Ben Fanson Simeon Lisovski Workshop Structure 1) 10 weeks meeting weekly a) Tuesday, 10am, green room
2) Sessions are broken up into 2 parts a) Lecture part (20 - 45min)
b) Hands on section (20 - 45min)
3) Informal setting (ask questions...) Our background Ben Fanson Masters in Statistics converted to R (2 years ago) from SAS
Simeon Lisovski R programmer for years has his own R package Workshop Goals 1) Development of an analysis workflow
2) Become proficient in R fundamentals for data analysis
Disclaimer 1) This is NOT a statistics workshop
2) there are many, many ways to do the same thing in R. Our goal is that you know how to do at least one way.
3) We do not focus on programming efficiency
4) This workshop is a work in progress... Lecture 1 Goals 1) convince you that you should care
2) overview of data analysis workflow
3) Why R?
4) R editors (Rstudio)
5) Getting started with R Workflow: Reasons to Care 1) Research more reproducible
2) Efficiency
3) Trend towards submitting code
Reasons to Care 1) Research more reproducible
2) Efficiency
3) Trend towards submitting code
Reasons to Care Reasons to Care 1) Research more reproducible
2) Efficiency
3) Trend towards submitting code
Reasons to Care Efficiency - ~80% of my data analysis is data cleaning, restructuring, subsetting, and visualizing.
- Re-running analyses after finding errors, new transformation, adding more data
- reviewers requesting a specific analysis
- easier to go back to old code and figure out your process
Reasons to Care 1) Research more reproducible
2) Efficiency
3) Trend towards submitting code
Reasons to Care R script and journals Lendvai et al. 2013 Poc Roy Soc R code R graphs Reasons to Care R script and journals Earn et al. 2014 Poc Roy Soc Population-level fever Reasons to Care R script and journals Earn et al. 2014 Poc Roy Soc Population-level fever ...(another 43 pages) Analysis Workflow Acquire/store Data Data Cleaning Reformatting Queries/merges Data Preparation Analysis Workflow Acquire/store Data Data Cleaning Reformatting Queries/merges Data Preparation Analysis Statistical Methods Assess output Visualize output Tabular Summaries Analysis Workflow Acquire/store Data Data Cleaning Reformatting Queries/merges Data Preparation Analysis Statistical Methods Assess output Visualize output Tabular Summaries Tables Figures Reports Datasets Write-up Transparency - organized, logical, well documented - e.g. commented code, structured programs, organized folders
Modularity - Keep scripts simple (not too many tasks per script), have re-usable functions in one location - e.g. file with project functions
Portability - Make it easy to share scripts (collaborators, reviewers) - e.g. relative pathnames, nested folders
Attributes of a good workflow Analysis Workflow Acquire/store Data Data Preparation Lots of systems available Data management Text files
Spreadsheets (e.g Excel)
Relational databases
Good practices Data management 1) Raw data should be read-only a) Data should match lab notebook exactly (e.g. do not remove outliers/suspected errors)
2) Separate tables for different response components (e.g. morphology, behavior, physiology) a) Keep tables simple (help prevents data entering errors)
3) Every data row should have unique identifier a) Minimize repeating of information (keeps table simpler and less likely to misspell/mis-enter info)
Good practices Data management 4) Keep track of all data files associated with a project and have a short description of what data they contain
5) Decide on naming conventions before collecting data a) e.g. male vs m vs. Male, Bactrocera tryoni vs. BT vs. B. tryoni) *can fix with string functions in R but makes life easier+
Text files Data management Pros Cons Free Very difficult for entering data Small file size Too much flexibility No formatting No quality control Flexibility in output mainly for computer output Spreadsheets Data management Pros Cons Easy to get to software Columns are not linked across a row Medium file size Columns can contain any data type Tabular structure
Information associated with formatting cannot be transferred Easy for entering data
Encourages non-reproducible restructuring No quality controls Data management Cons Columns are not linked across a row Columns can contain any data type Information associated with formatting cannot be transferred Encourages non-reproducible restructuring No quality controls Data management Cons Columns are not linked across a row Columns can contain any data type Information associated with formatting cannot be transferred Encourages non-reproducible restructuring No quality controls Data management Cons Columns are not linked across a row Columns can contain any data type Information associated with formatting cannot be transferred Encourages non-reproducible restructuring Large datasets can be a mess No quality controls Data management Cons Columns are not linked across a row Columns can contain any data type Information associated with formatting cannot be transferred Encourages non-reproducible restructuring No quality controls Data management Cons Columns are not linked across a row Columns can contain any data type Information associated with formatting cannot be transferred Encourages non-reproducible restructuring Large datasets can be a mess No quality controls Databases
Data management Pros Cons Tables are defined rigorously Steep learning curve Columns can have only one data type large file size
A row is a single record and cannot be broken up Software can be more difficult to get Tables are linked enforcing data integrity Data entry can be more difficult than spreadsheet if trying to enter into table (however, creating form is great way to enter data ) Data entry forms can be created Quality controls Rcourse_project.accdb Analysis Workflow Acquire/store Data Data Cleaning Reformatting Queries/merges Preparation Analysis Workflow Data Cleaning Reformatting Queries/merges Data Preparation Analysis Statistical Methods Assess output Visualize output Tabular Summaries Tables Figures Reports Datasets Write-up Why R and not other software 1) Complete analysis software 2) Free and open source 3) Developed by statisticians (and used by) 4) Popular (lots of contributions) (Note - Negative of this is that most packages are written by nonprogrammers so R code is often inefficient, inconsistent, and error messages are less than informative) 5) R is a functional programming language 6) Lots of data types (e.g. image, spatial/GIS, genetics)
Why R and not other software R vs SAS vs. SPSS vs. MATLAB (For a comparison http://stanfordphd.com/Statistical_Software.html)
>4500 packages Number of Packages Over Time Ranking of Statistics Software by use GUIs/Editors 1) R console [ default editor ] (not recommended) 2) Rstudio [we will use this one] 3) TinnR 4) Vim-R 5) Emacs 6) Others http://stackoverflow.com/questions/1173463/recommendations-for-windows- text-editor-for-r Rstudio overview Complete R interface a) Editor with colour coding and tab autofill/help pages b) Integrated plots c) R console d) History e) Version control (e.g. git or subversion) f) Variable list
Rstudio example Getting Started with R Rprofile.site, .Rprofile These files are sourced on startup of R Rprofile.site is sourced first. e.g. C:\Program Files\R\R-3.1.0\etc\ .Rprofile in Local (project location) is then sourced Getting Started with R what to put in Rprofile.site, .Rprofile Packages you commonly use [ e.g. library(ggplot2) ]
I source my list of Rfunctions (stored in My Documents - which is part of my data backing up)
- For some more tips, see http://stackoverflow.com/questions/1189759/expert-r-users-whats-in- your-rprofile - http://www.r-bloggers.com/customize-your-rprofile-and-keep-your-workspace-clean/ R scripts Setting up your Script 1. Header describing file
2. Global variables a) e.g. db_dir <- C:/ db_file <- database.accdb # I have database location and file that can used with my personal access function
3. Directories [best to use relative pathname] a) setwd() # set the working directory b) getwd() # get current working directory c) list.files(), dir() # get a list of files in a directory
4. Source file with functions for project [if you did not use .Rprofile]
Header Global settings Section 1 Online Google Stackoverflow (general programming stackoverflow.com) Cross validated (statistics - stats.stackexchange.com/) Quick-R (http://www.statmethods.net/)
Books (plenty out there to choose from; lots of redundancy) R cookbook The R book (the big book) Data Analysis and Graphic using R A Beginners Guide to R ( by Zuur)
(see http://www.r-bloggers.com/r-programming-books-updated/ for more suggestions) R Resources R Core Concepts R as matrix language R objects and classes Importing data Next Week Lecture 1: Hands on Section Setup directories for Rcourse_proj 1) create folder '/Rcourse_proj' [pick wherever you want]
2) create folder '/Rcourse_proj/R programs'
3) create folder '/Rcourse_proj/data'
Create Rstudio project 1) Open Rstudio
2) Create new project associated with directory /Rcourse_proj
Initialize .Rprofile Create .Rprofile - in windows, this is slightly harder than it should be. Open Rcoure_proj and paste the following code sink('.Rprofile') # create a new file called .Rprofile sink() #this creates a file in your working directory (make sure directory is Rcourse_proj)
Add following text to .Rprofile (open file in text editor to add below) printResult <- function(){print('Yes, this worked') } printResult()
sink('.Rprofile'); sink() Create an R script Make a new R script
Save as 'R programs/lecture1.r'
Load a library/package First install the package from CRAN (R website)... install.packages('ggplot2') # can use Tools>Install packages... in Rstudio
Next load the package... library(ggplot2) # or require(ggplot2) example(ggplot) # see that is works Basic R aspects
1) commenting are indicated by # # this is a comment and will not be compiled 2) <- (or =) assigns an object x <- 10 # or x = 10 y <- 'test' z <- c(1,2,3,4) # c() concatenates all the items makes a vector 3) Operators [ +, - , *, / , ^] x + x (total <- x * x ) # parentheses print out the results x^2 4) Attributes of an object [will take more about next week] class(x) str(x) # difference will become apparent when we work with lists
Basic R aspects 5) Logical x <- 5; y <- 'test' # make two variables x == 5 # returns TRUE x < 1 # returns FALSE x == 5 & y == 'test' # returns TRUE x == 15 | y == 'test' # returns TRUE
6) Functions have parenthesis after mean(1:5) # get the mean plot(x=1:10,y=1:10) # scatterplot
Common R mistakes 1) '\' vs '/' in file paths Windows uses '\', R uses '/'
2) R is case sensitive: Mean() mean() # use autofill tab in Rstudio to help with this
3) '=' vs. '==' # latter is comparing two items logically x = 5 # sets x to 5 x == 10 # compares x to 10 4) typing to next line need ',' or operator to indicate line is not completed x<- c(1,1,1,1, 0,0,0,0) # this works x<- 1 + 2 # this works x <- 1 + 2 # does not work as x<-1 is complete Global attributes options() # get a list of current options options( stringAsFactors=TRUE ) # will explain this one in next lecture
directories setwd('R programs') # set working directory to R programs getwd() # see what is your working directory dir() # list of contents of directory setwd('..') # go up a level dir.create('query') # create a new folder called queries in current folder
Some useful functions/objects ls(), rm() # get a list of current options ls() # get a list of objects rm() # delete one or more object rep(), seq() rep(x=5, times=10) # repeat 5 ten times seq(from=0, to=1, by=0.1) # make a sequence LETTERS, letters LETTTERS # capitalized alphabet letters # lowercase alphabet
Some useful functions/objects month.abb, month.name month.abb # list of abbreviated month month.name # list of month names pi pi # 3.141593 Functions Learned Functions getwd() c() setwd() ls() library() class(), str() *, /, +,-,^,&,| dir(), list.files() options() dir.create() source() rep() rm() seq()