Professional Documents
Culture Documents
com
Phone: 844-448-1212 Web: http://www.rstudio.com
Garrett Grolemund
Data Scientist and Master Instructor
November 2015
Email: garrett@rstudio.com
Clarify
Become familiar with the data,
template a solution
Develop
Create a working model
Productize
Automate and integrate
Publish
Socialize R Markdown
Clarify
Become familiar with the data,
A scripting language
template a solution that is
very useful for analyzing data.
Develop
Create a working model
Productize
Automate and integrate
Publish
Socialize R Markdown
Clarify
Become familiar with the data,
template a solution
Develop
Create a working model
Productize
Scale Up*
Automate
Generalize
and integrate
to entire data set
Publish
Socialize
* sometimes
2015 RStudio, Inc. All rights reserved.
How do
analysts use
big data?
Analytic Big Data Problems
Class 1. Extract Data
Problems that require you to use all of the data at once. These
problems are irretrievably big; they must be run at scale within the
data warehouse.
Clarify
Become familiar with the data,
template a solution
Develop
Create a working model
Productize
Scale Up* Class 2
Automate
Generalize
and integrate
to entire data set Class 3
Publish
Socialize
23 23 NA 55
83 5 432 90
4 39 1 NA
66 99 20 81
lm()
23 23 NA 55
83 5 432 90
4 39 1 NA
66 99 20 81
lm()
stat-computing.org/dataexpo/2009/
or t
o
em n
m es
in do
fit ta
Da
2015 RStudio, Inc. All rights reserved.
Example Task
1. Collect random sample of training data
Load balancing
Collaborative
Administrative tools
editing
management
Audit history
Metrics and
monitoring
RStudio RStudio Server
RStudio Server
Desktop IDE Open Source Pro
"Data
2015 RStudio, Inc. All rights reserved.
dplyr
Package that provides data manipulation syntax
for R. Comes with built-in SQL backend:
1. Connects to DBMSs
2. Transforms R code to SQL, lm()
sends to DBMS to run in DBMS
3. Collect results into R 23
83
66
23
39
99
NA
432
20
55
90
NA
81
src_ function:
src_postgres, src_sqlite
src_mysql, src_bigquery
db <- src_postgres(
Driver specific
dbname = 'DATABASE_NAME',
Save to arguments.
host = 'HOST',
use
port = 5432, For PostgreSQL:
dbname, host,
https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
%>%
dd mean( , na.rm = TRUE)
2015 RStudio, Inc. All rights reserved.
show_query
To see the SQL that dplyr will run.
show_query(clean)
show_query
tbl made from table reference
Operators +, -, *, /, %%, ^
y
q3 <- filter(q2, depdelay < 240)
r
? ue
se q
q4 <- select(q3, arrdelay, depdelay, year)
r
ba ly
ta dp
q4
da d
e ul
th o
sh
n
he
W
2015 RStudio, Inc. All rights reserved.
Lazy Execution 1
result as a tbl_df
# build model
mod <- lm(gain ~ depdelay + distance + uniquecarrier,
data = random)
RHadoop
PivotalR
SparkR
RODBC
rm(db)
gc() Close connection when finished
2
. Calculate summaries manually in data store
unique unique
year arrdelay depdelay distance count
carrier carrier
NW 381213
NW 381213
unique
carrier
count ggplot(clsummary) +
DH 123752 geom_bar(aes(x = uniquecarrier,
US 431913 y = count),
AA 677471 stat = "identity")
F9 37710
HP 105926
AS 189748
AQ 5368
9E 38367
EV 326694
NW 381213
```{r cache=TRUE}
d <- flights %>%
select() %>%
filter() %>%
mutate() %>%
collect()
```
Big Data and R Markdown
2
. Set engine to use another language if sensible.
asis
fortran perl
sed
asy
gawk
psql sh
awk
groovy
python
stan
bash
haskell
Rcpp stata
c highlight
Rscript
tikz
cat
lein
ruby
zsh
coee
mysql sas
dot
node
scala
Shiny
and big data
Big Data and Shiny
1. Avoid unnecessary repetitions
Instance Instance
ui <- fluidPage(
sliderInput(inputId = "num",
Application label = "Choose a number",
Instance value = 25, min = 1,
max = 100),
plotOutput("hist")
)
Worker Worker Worker
server <- function(input, output) {
user user user user user output$hist <- renderPlot({
Update Update Update Update Update hist(rnorm(input$num))
Update Update Update Update Update
Update
Update Update
Update
Update
Update
Update
}
Update
Update Update
ui <- fluidPage(
sliderInput(inputId = "num",
Application Code outside the label = "Choose a number",
Instance server function value = 25, min = 1,
will be run once max = 100),
per R worker plotOutput("hist")
)
Worker Worker Worker
server <- function(input, output) {
user user user user user output$hist <- renderPlot({
Update Update Update Update Update hist(rnorm(input$num))
Update Update Update Update Update
Update
Update Update
Update
Update
Update
Update
}
Update
Update Update
ui <- fluidPage(
sliderInput(inputId = "num",
Application Code outside the label = "Choose a number",
Instance server function value = 25, min = 1,
will be run once max = 100),
per R worker plotOutput("hist")
)
Worker Worker Worker Code inside the
server function
will be run once server <- function(input, output) {
user user user user user per connection output$hist <- renderPlot({
Update Update Update Update Update hist(rnorm(input$num))
Update Update Update Update Update
Update
Update Update
Update
Update
Update
Update
}
Update
Update Update
ui <- fluidPage(
sliderInput(inputId = "num",
Application Code outside the label = "Choose a number",
Instance server function value = 25, min = 1,
will be run once max = 100),
per R worker plotOutput("hist")
)
Worker Worker Worker Code inside the
server function
will be run once server <- function(input, output) {
user user user user user per connection output$hist <- renderPlot({
Update Update Update Update Update hist(rnorm(input$num))
Code inside of a
Update Update Update Update Update
Update
Delay reactions
eventReactive()