You are on page 1of 63

Using R in Azure Machine Learning

Take Azure ML to the next level with R

Alnis Bajars. alnis@bajars.com @alnisb


Agenda
Using R in Azure Machine Learning

R – Ecosystem Fundamentals

R – Selected Language Elements

Data Science Principles (including some lingo)

Azure ML Quick Overview

Azure ML + R
Assumptions
• We can’t do any one topic proper justice.
• So this talk will introduce the core ecosystem for your own follow up.
• No mathematical proofs.

Hopefully.
• You will know what you don’t know about Data Science
• Set expectations and realities about Data Science

These slides are meant to be used !


References
Coursera – R Programming (Johns Hopkins)
• Quality of explanations variable.
• But the practical assignments are good, and deadline driven.

Safari Books Online


• Video. Introduction to Data Science with R. Garrett Grolemund (R Studio).
• Most published books on R

Hadley Wickham (@hadleywickham) – Modern Godfather of R


• Chief Data Scientist at R Studio
• Author of influential R packages

Azure Machine Learning


• Intro video at Microsoft Virtual Academy
References contnued
edX – Azure ML with R/ Python
• A number of demos driven by this content.
R – Ecosystem Fundamentals
R Pros
R vs Python
• More mature data science support (20 years +), purpose built
• More established ML support
• T-SQL integration in 2016 (out of scope)

Python Pros
• Best all round script language. Data science support improving.
• Better 64 bit support and scalability?

R performance and scalability – don’t forget Revolution Analytics


https://cran.r-project.org/ (Revolution Analytics R not covered)
Get R Studio. https://www.rstudio.com/
R Ecosystem
• Essential IDE. But–much
Essential Bits RPubs etc..
more, packages,
• Download R, then R Studio.

Get a Github account. https://github.com/ and Github shell.


• Distributed source code control system.
• Essential part aofbetter
RStudio R social network.
environment for test and debug than Azure ML!
From github.com
Github Lifecycle Cheat Sheet
• Create repository (or fork someone else’s).

From local Github shell.


git clone <URL_of_repository>
cd <repository>
git add <files>
git commit –a –m “some_message”
git push
R Package Management

You’ll be doing this a lot!


To install a package at the command line.
install.packages("ggplot2“) (multiple dependency options)

Or use R Studio.
R
To Package Install/
use an installed package.Reference
At the command line.
library("ggplot2")
R Studio code hint.

Can install libraries from Github (user/repository)


library(“devtools")
install_github( 'ramnathv/rCharts')

Older versions of install_github have user and repository as separate arguments


plot
R Visualisation
• Standard Packages
package. Easy to use but presentation ordinary.

lattice
• Enhanced package. Not very widely adopted.

ggplot (by Hadley Wickham) – Grammar of Graphics


• Best quality presentations yet easy to use
• Layers approach: ggplot
• Quickie version: qplot
qplot simple example
ggplot2 example inc Linear Model
ggplot2 … if you really want to get funky…
ggplot2 and the Boxplot

Concise way to show median, 1st/ 3rd quartiles, 1.5 * IQR and outliers.
Scatter plot matrix and R pairs function

Concise way to relationships between all features.


R Data Wrangling Packages
dplyr
• Extensive function set for select/ sort/ filter/ derived columns/ group by/ top n.
• Note %>% directive to chain dplyr functions – pipeline like

tidyr (Hadley Wickham)


• Statisticians called cleansed data tidy data.
• Normalise/ denormalise.

sqldf
• Surprisingly good SQL syntax fidelity
knitr
R
• RDynamic
Markdown +Report
embeddedPackages
R code => reports. HTML/ PDF/ Latex.
• Ideal platform for Reproducible Research.
• Demo. Properly cool.

shiny
• Interactive publishing of R driven web pages. Client and server bits.

slidify
• Generation of slide decks from R Markdown/ YAML/ R.
R – Selected Language Basics
R Fundamental Data Structures

Script language (Perl/ Python/ Ruby) data structures.


• Scalar
• Array
• Hash (key/value)

Contrast with R data structures


• Vector (a “scalar” is really a 1 element vector)
• Matrix (caveat – data of same type)

R is case sensitive everywhere! (Variables, functions etc.)


The data frame is an operational tabular structure, integral to data manipulation.
R Data Types
Atomic data types.
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
typeof function handy
R Assignment and c function

Two different modes, generally equivalent.


• The <- form most popular.

c for combine to build free form vectors.


Reading Data and Missing Values

A number of functions to read data files (usually read.table).


• Generally into data frames.

How are values not entered handled?


• R default is NA
• This can be overwritten
Looking at the data

A number of handy functions. (Factor – discrete values)


R as a Functional Programming Language

In R, functions are 1st class objects. This is widely used.


Eg apply family of functions. apply, sapply, lapply
View command – R Studio Console

Needs no further introduction!


Data Science Principles
Some General Notes
Algorithms vs Data
• Lots of data tends to be more influential than choice of algorithm
• Data collection methodology is critical

Correlation implies Causation?


• No!

Outliers
• Extreme values well outside the norm. Eg Australia’s billionaires
• How are they handled? Depends.

Variable Types (affects Algorithm choice)


• Continuous, eg apartment price
• Discrete, eg species of Iris. Don’t forget R function stringsAsFactors
Data Analysis Flowchart
Codebook and Interpetation
Codebook is what Statisticians call the document that is
• Field spec of the data
• Details about the data collection

Reference to data set


• US NOAA storm database
http://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype
Read and interpret the Codebook carefully
• Eg Time based issues, all weather events only recorded since 1/1/96
• Careful combining features, eg # fatalities + # injuries does
not make sense
Machine Learning – Predictive Types
Supervised Learning
• Train model based on past results, validate with test data
• Independent variables or features as predictors
• Label or dependent variable to predict.
• Eg predict house price based on size, # rooms etc

Unsupervised Learning
• No past results to train on, thus more difficult to evaluate
• Find patterns, often using clustering
• Eg Google News
Supervised Learning Experiments
Split available data into training and test samples
• Often training 70% as a rule of thumb
• Fit a model against training of close to just right accuracy
• Validate model against test set

Beware of.
• Underfitting. Not a convincing predictor.
• Overfitting. Too much fitting of errors/ outliers. Great fit of training
data, rubbish for other data sets.
Experiment Types

At a very high level.


• Regression. Fit mathematical (often linear) to predict continuous
values.
• Classification. Predict discrete values.
• Clustering. Group data items based on similarity.
• Recommender.
• Anomaly Detection. Detect exception cases.
Feature Selection

Your training data has a lot of features. Should we use them all?
• No! Too many dimensions, too much noise.
• Punt collinear features, those with marginal value
• Combine features where it makes sense
• randomForest model to assess importance
• Stepwise elimination of features, R has step() function
• Be ruthless!
Averages and Standard Deviation
How to do an average.
• Mean. Sum of observations / # of observations – outlier sensitive
• Median. Middle value
• Mode. Most common value, best for factors (categorical)

Spread of data.
• Variance is (Value – Mean) squared / # observations. Square to (a)
take absolute value (b) better vibe of the data.
• Take square root of variance to get Standard Deviation which brings
value in same scale as observations, thus commonly used.
Normalize Data/ R scale function

Features you want to compare naturally have different scales.


• Eg
• The bigger numbers will swamp small numbers in importance.

Solution? Scaling.
• Common solution is to normalize data to a scale where mean = 0 and
standard deviation = 1.

Note Azure ML has a Normalize Data module. R has a scale function.


Hypothesis Testing and Confidence Intervals
The protocol for hypothesis.
• Hypothesis 0 is the status quo.
• Hypothesis 1 is the alternative (eg new drug).
• Aim is to reject H0 in favour of H1 (or not)

The result is generally framed within a confidence level (p value).


• Commonly use 95%, a throwback to pre computer days.
• Controversy. The Earth is Round (p < 0.05)
Tidy Data

Described by Hadley Wickham in


• Paper - http://vita.had.co.nz/papers/tidy-data.pdf
• Video - https://vimeo.com/33727555

Principles
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Azure ML – Quick Overview
Azure ML – Get Started

What you need. Site is https://studio.azureml.net


• Azure account – does not have to be a trial, Machine Learning has a
free tier.
• Storage account.

* Trap. Storage account must be in same location as ML. Australia


might not be available.
Azure ML – Flowchart
Azure ML Example – Compare Two Models
Azure ML Example – Prefab Data Wrangling
Azure ML – Re-use and Monetisation

Re-use via web services.


• REST APIs
• Code snippets in C#, R and Python.

Publish said web services to Azure Marketplace.


• Fairly involved diligence process including approvals.

Sadly, both topics out of scope.


Apply SQL Transformation Module

Use SQL syntax for data wrangling, based on SQLite.

I/O
• 3 input ports, internally use “tables” t1, t2 and t3
• 1 output port with results

Within Azure ML, an easier alternative to the R package sqldf.


Extend ML with R
• Its own environment (avoid namespace collisions)
• Need to load packages

Execute R Script and I/O


• Install new packages via zip

Execute R Script
• Dataset[12]; Azure table -> R data frame
• Script bundle; Zip -> code, objects, packages

3 input ports

2 output ports

• Results; R data frame -> Azure table


• R Device; stdout, stderr, graphics
Template code for Execute R Script
Execute R Script – a “real” example
Debugging R Code
What if code runs ok in RStudio but not in ML?

There is no debugger as such in ML, so


• Induce an error in R code, eg refer uninitialised object
• Right click R script module, select View Error Log
• Right click R script module, select View Output Log

Latter has more detail


Sample Output Log
Create Your Own R Library

Fairly mechanical.
• Create your own source function(s) in a .R file
• Zip up that file, with the name you want displayed in ML
• In ML, call Add Dataset to import file.
• Visible in My Datasets in ML.
Own R Library Example
Create R Model Module
A module which includes model and scoring scripts
• Own R environment
• Only pre loaded R packages
• Only one output, no graphics

I/O
• Input. Training data frame
• Output. Model object.

Scripts
• Trainer script
• Scorer: uses R predict function
Sample R Model Module Code

Note most set and get functions local to R Model Module.


Sample training script.

Sample scoring script.


Loading R Packages into Azure ML

There are “only” 350 R Packages in Azure ML – you’ll eventually want to


use other packages.

To load an R Package into Azure ML.


• Find the package and download as zip locally
• In ML Studio, select the big “+ NEW” option bottom LHS
• Select DATASET -> FROM LOCAL FILE
• Follow the bouncing ball
Using Loaded R Packages in Azure ML

Effectively need to install each use in Execute R Script.


Demos – CA Dairy Data

Really simple example of R, plus custom library in action.


Steps we take.
• Make Overall Height and Orientation categorical (what R calls Factors).
Energy
• Efficiency
Make all column Visualisation
headers CamelCase (remove spaces) to play nicer with R.
• Add R code to use dplyr to create derived columns for squares and cubes.
• Normalize Data for all numeric columns, transformation method MinMax. Mean 0
and standard deviation 1.
• Add R code to visualise data.
Now let’s do some data science !

Energy

Efficiency Visualisation
Project Columns module to punt a few columns.
continued
• Use the Linear Regression, solution method Ordinary Least Squares.

• Split Data module – 60% training, 40% test


• Train Model module – Linear Regression plus Training data
• Permutation Feature Importance to score model against Test data
Energy Efficiency Visualisation – the score

The relative feature importance.


Summary

Please take this presentation as a call to action.

Alnis Bajars. Email: alnis@bajars.com Twitter: @alnisb

You might also like