You are on page 1of 21

Data Analytics using R (DA-R)

Books
1. Lander, J. (2013). R for Everyone: Advanced Analytics and Graphics.
New Jersey: Addison-Wesley.
2. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
Introduction to Statistical Learning: with Applications in R. New
York: Springer-Verlag.
website: http://www-bcf.usc.edu/~gareth/ISL/

Internet Websites
http://www.r-bloggers.com/
https://stat.ethz.ch/mailman/listinfo/r-help
http://stackoverflow.com/questions/tagged/r
http://blog.revolutionanalytics.com/r/
http://chance.amstat.org/
http://www.statslife.org.uk/significance

Journals
The R Journal (web: https://journal.r-project.org/)
Journal of Statistical Software (web: http://www.jstatsoft.org/index)

Data Analytics
Analytics is the scientific process of transforming data into insight for
making better decisions.
Thus analytics is employed for data-driven or fact-based decisionmaking.
By using analytics, the managers can certainly improve decisions over
time.

Key Questions addressed by Analytics


Past
What Happened?

Present
What is happening
now?

Future
What will happen?

(Reporting)

(Alerts)

(Extrapolation)

How and why did it


happen?

Whats the next


best action?

Whats the
best/worst that can
happen?

(Modeling,
Experimental
Design)

(Recommendation)

(Prediction,
optimization,
simulation)

Information

Insight

Davenport, T. H. , Harris, J. G., & Morison, R. (2010). Analytics at Work: Smarter Decisions, Better Results. Harvard
Business Review Press.

Categorization of Analytical Methods and


Models
1. Descriptive Analytics
2. Predictive Analytics
3. Prescriptive Analytics

Descriptive Analytics
Descriptive Analytics consists of set of techniques that describes what
has happened in the past.
Examples: Data Queries, Reports, Descriptive Statistics, Data
Visualization, etc.

Predictive Analytics
Predictive analytics comprises of the set of techniques that use
models constructed from the past data to predict the future or study
the impact on one variable on the other.
Examples: Linear Regression, Time Series Analysis, etc.

Prescriptive Analytics
Prescriptive analytics provides a best course of action to take, i.e., the
output from a prescriptive analytics model is the best solution.
A common example is portfolio models in finance, which determine
the mix of investments that yield the highest expected return while
limiting the exposure to risk.

Data Set 1
Consider an Advertising data set consisting of the sales of a particular
product in different markets.
The data set also provides the advertising budgets for three different
media: TV, radio, and newspaper.
Goal is to find out which media generate the biggest boost in sales.

Data Set 2
Consider the Default data set which provides information about the
customers on the following variables:
Whether a customer has defaulted on his/her credit card payment.
Annual income
Annual Credit Card Information
Student Status

Objective is to predict whether a customer is going to default on


his/her credit card payment.

Data Set 3
Consider the Customer data base which has access to a large number
of measurements (e.g., household income, occupation, distance from
nearest urban area, and so forth) for a large number of number of
people.
Goal is to perform market segmentation by identifying subgroups of
people who might be more receptive to a particular form of
advertising, or more likely to purchase a particular product.

Supervised Learning vs Unsupervised


Learning
Supervised Learning: both X and Y are known
Unsupervised Learning: only X

Supervised Learning
Supervised Learning is where both the predictor(s), , and the
response, , are observed.
Main purpose is either to predict based on or to understand the
relationship between and .
Supervised learning problems can be further divided into regression
and classification problems based on the nature of .

Supervised Learning: Techniques


Linear Regression
Logistic Regression
Decision Trees
Support Vector Machines

Unsupervised Learning
A set of statistical tools intended for the setting in which we have only
a set of features 1 , 2 , , measures on observations.
We are not interested in prediction, because we do not have an
associated response variable .
The goal is to discover interesting things about the measurements on
1 , 2 , , .
Is there any informative way to visualize the data?
Can we discover the subgroups among the variables or among the
observations?

Unsupervised Learning: Challenges


The exercise tends to be more subjective, and there is no simple goal
for the analysis, such as prediction of a response.
Unsupervised learning is often performed as a part of an exploratory
data analysis.
It can be very hard to assess the results obtained from the
unsupervised learning methods since there is no universally accepted
mechanism for validating results on an independent data set.

Unsupervised Learning: Techniques


1. Principal Component Analysis: a tool used for data visualization or
data pre-processing before supervised techniques are applied.
2. Clustering: a broad class of methods for discovering unknown
subgroups in data.

Objectives of This Course


Learning a number of supervised and unsupervised learning
techniques.
Implementation of these techniques in R.

R: Some of the Best Features


Open Source Software, available on every major platform.
Massive set of packages for visualization, statistical modelling,
machine learning, and importing and manipulating data.
Readily available tools for data analysis.
A great community. Easy to get help from experts.
Readily available tools for communicating results.
Can connect to high-performance programming languages like C, C++,
Fortran.
(Check Advanced R by Hadley Wickham for more details)