You are on page 1of 18

Intro to R for the SQL Server Pro

Luis Figueroa
Solutions Architect
BlueGranite, Inc
MCITP SQL Server 2008
Microsoft V-TSP

Email: lfigueroa@blue-granite.com
Twitter: @luisefigueroa
LInkedIn: https://www.linkedin.com/in/luisefigueroa
WHY ARE WE HERE TODAY?
WHY ARE WE HERE TODAY?

• The increasing amount and complexity of data makes it more and more
difficult to infer insights with simple data exploration techniques. Advanced
statistics are required to learn more from our data.

• The tremendous increase and commoditization of computing power makes it


possible to explore large amounts of detail data.

• Huge amounts of resources for development of advanced algorithms for big


data.

• Very mature statistical applications and algorithms.


“Data plus math and statistics only gets you machine learning” - Drew Conway

Drew Conway
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
The Data Science Process

Computational Informational Design


by Ben Fry

http://benfry.com/phd/
Why R?

• R is Free

• Functional programming enables you to do more without having


to implement lower level operations to execute tasks

• Great development environment

• Vast ecosystem of developers and thriving support community

• Azure ML supports R

• SQL Server 2016 will support R natively!


Overview and History

S created in the 70s for statistical analysis

1991 - R is created based on the S language

1995 - R adopts the GNU general public license, making it free

1997 - R core group is formed. This entity governs the evolution of the R
language

2000 - Version 1.0 is released

2015 - R 3.2 (Full of Ingredients) hits general availability


Where do I start?

Download and install R:


http://cran.r-project.org/mirrors.html

Download RStudio:
http://www.rstudio.com/products/rstudio/

Revolution R Open
http://mran.revolutionanalytics.com/download/

All available for Windows, Linux, Unix, OS X


Start with the following packages
To load data
RODBC, RMySQL, RPostgresSQL, RSQLite - If you'd like to read in data from a database, these packages are
a good place to start. Choose the package that fits your type of database.
XLConnect, xlsx - These packages help you read and write Micorsoft Excel files from R. You can also just
export your spreadsheets from Excel as .csv's.
foreign - Want to read a SAS data set into R? Or an SPSS data set? Foreign provides functions that help you
load data files from other programs into R.
R can handle plain text files – no package required. Just use the functions read.csv, read.table, and read.fwf. If
you have even more exotic data, consult the CRAN guide to data import and export.

To manipulate data
dplyr - Essential shortcuts for subsetting, summarizing, rearranging, and joining together data sets. dplyr is
our go to package for fast data manipulation.
tidyr - Tools for changing the layout of your data sets. Use the gather and spread functions to convert your data
into the tidy format, the layout R likes best.
stringr - Easy to learn tools for regular expressions and character strings.
lubridate - Tools that make working with dates and times easier.

More really useful packages:


https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages
R 101 And Demo
Resources - Free Training

The R Project for Statistical Computing


http://www.r-project.org

RStudio - Online Learning


http://www.rstudio.com/resources/training/online-learning/

Datacamp - R & Data Science


https://www.datacamp.com/courses

TryR
http://tryr.codeschool.com

Johns Hopkins University - Data Science


https://www.coursera.org/specialization/jhudatascience/1?utm_medium=catalog
Resources - Community
Leada
https://www.teamleada.com/courses

Swirl
http://swirlstats.com

R-Bloggers
http://www.r-bloggers.com

Twitter - One R Tip a Day


@RlangTip

StackOverflow
http://stackoverflow.com/questions/tagged/r

Search Engine - RSeek.Org


http://rseek.org
Resources - Books and People

An Introduction to Statistical Learning (Intro - Start here)


http://www-bcf.usc.edu/~gareth/ISL/

The Elements of Statistical Learning (More Advanced)


http://statweb.stanford.edu/~tibs/ElemStatLearn/

Hillary Mason: https://twitter.com/hmason?lang=en

Nate Silver: https://twitter.com/NateSilver538?lang=en

Andrew Ng: https://twitter.com/AndrewYNg


Resources - Public Datasets
University of California, Irvine
http://archive.ics.uci.edu/ml/

Government Open Data Initiative


http://data.gov

Data Science Central


http://www.datasciencecentral.com/profiles/blogs/great-
github-list-of-public-data-sets

GitHub - Awesome public datasets


https://github.com/caesar0301/awesome-public-datasets

Data Science Central - datasets search


http://www.datasciencecentral.com/page/search?q=data
+sets

http://www.image-net.org
Resources - Infographics

Choosing R Vs Python Infographic


http://blog.datacamp.com/wp-content/uploads/2015/05/R-vs-
Python-216-2.png

How to become a Data Scientist


http://www.r-bloggers.com/how-to-become-a-data-scientist-in-8-easy-
steps-the-infographic/
Azure ML Algorithm Cheat Sheet

http://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/
Download today’s slides and project from:
https://github.com/luisefigueroa/Intro-to-R-for-the-SQL-Server-Pro

Luis Figueroa
Solutions Architect
BlueGranite, Inc
MCITP SQL Server 2008
Microsoft V-TSP

Email: lfigueroa@blue-granite.com
Twitter: @luisefigueroa
LInkedIn: https://www.linkedin.com/in/luisefigueroa

You might also like