You are on page 1of 21

Data Analytics using R (DA-R)

1. Lander, J. (2013). R for Everyone: Advanced Analytics and Graphics.
New Jersey: Addison-Wesley.
2. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
Introduction to Statistical Learning: with Applications in R. New
York: Springer-Verlag.

Internet Websites

The R Journal (web:
Journal of Statistical Software (web:

Data Analytics
Analytics is the scientific process of transforming data into insight for
making better decisions.
Thus analytics is employed for data-driven or fact-based decisionmaking.
By using analytics, the managers can certainly improve decisions over

Key Questions addressed by Analytics

What Happened?

What is happening

What will happen?




How and why did it


Whats the next

best action?

Whats the
best/worst that can






Davenport, T. H. , Harris, J. G., & Morison, R. (2010). Analytics at Work: Smarter Decisions, Better Results. Harvard
Business Review Press.

Categorization of Analytical Methods and

1. Descriptive Analytics
2. Predictive Analytics
3. Prescriptive Analytics

Descriptive Analytics
Descriptive Analytics consists of set of techniques that describes what
has happened in the past.
Examples: Data Queries, Reports, Descriptive Statistics, Data
Visualization, etc.

Predictive Analytics
Predictive analytics comprises of the set of techniques that use
models constructed from the past data to predict the future or study
the impact on one variable on the other.
Examples: Linear Regression, Time Series Analysis, etc.

Prescriptive Analytics
Prescriptive analytics provides a best course of action to take, i.e., the
output from a prescriptive analytics model is the best solution.
A common example is portfolio models in finance, which determine
the mix of investments that yield the highest expected return while
limiting the exposure to risk.

Data Set 1
Consider an Advertising data set consisting of the sales of a particular
product in different markets.
The data set also provides the advertising budgets for three different
media: TV, radio, and newspaper.
Goal is to find out which media generate the biggest boost in sales.

Data Set 2
Consider the Default data set which provides information about the
customers on the following variables:
Whether a customer has defaulted on his/her credit card payment.
Annual income
Annual Credit Card Information
Student Status

Objective is to predict whether a customer is going to default on

his/her credit card payment.

Data Set 3
Consider the Customer data base which has access to a large number
of measurements (e.g., household income, occupation, distance from
nearest urban area, and so forth) for a large number of number of
Goal is to perform market segmentation by identifying subgroups of
people who might be more receptive to a particular form of
advertising, or more likely to purchase a particular product.

Supervised Learning vs Unsupervised

Supervised Learning: both X and Y are known
Unsupervised Learning: only X

Supervised Learning
Supervised Learning is where both the predictor(s), , and the
response, , are observed.
Main purpose is either to predict based on or to understand the
relationship between and .
Supervised learning problems can be further divided into regression
and classification problems based on the nature of .

Supervised Learning: Techniques

Linear Regression
Logistic Regression
Decision Trees
Support Vector Machines

Unsupervised Learning
A set of statistical tools intended for the setting in which we have only
a set of features 1 , 2 , , measures on observations.
We are not interested in prediction, because we do not have an
associated response variable .
The goal is to discover interesting things about the measurements on
1 , 2 , , .
Is there any informative way to visualize the data?
Can we discover the subgroups among the variables or among the

Unsupervised Learning: Challenges

The exercise tends to be more subjective, and there is no simple goal
for the analysis, such as prediction of a response.
Unsupervised learning is often performed as a part of an exploratory
data analysis.
It can be very hard to assess the results obtained from the
unsupervised learning methods since there is no universally accepted
mechanism for validating results on an independent data set.

Unsupervised Learning: Techniques

1. Principal Component Analysis: a tool used for data visualization or
data pre-processing before supervised techniques are applied.
2. Clustering: a broad class of methods for discovering unknown
subgroups in data.

Objectives of This Course

Learning a number of supervised and unsupervised learning
Implementation of these techniques in R.

R: Some of the Best Features

Open Source Software, available on every major platform.
Massive set of packages for visualization, statistical modelling,
machine learning, and importing and manipulating data.
Readily available tools for data analysis.
A great community. Easy to get help from experts.
Readily available tools for communicating results.
Can connect to high-performance programming languages like C, C++,
(Check Advanced R by Hadley Wickham for more details)