You are on page 1of 125

ANALYTICS BOOT CAMP

COURSE
Analytics Boot Camp Course
Aditya Joshi
Kumar Rishabh
Srinivas Nv Gannavarapu
the science of examining raw data
with the purpose of drawing
conclusions about that information
- Techtarget
Data analytics refers to qualitative and
quantitative techniques and processes used
to enhance productivity and business gain.
Data is extracted and categorized to identify
and analyze behavioral data and patterns,
and techniques vary according to
organizational requirements. - Techopedia
Brainstorm
Where else is Data Analytics used?
Workflow
• Planning, organizing and requirement gathering
• Gathering Data
• Data Cleaning
• Analyzing Data, Predictive Modelling and Result Generation
• Result Presentation
Learning Process

Obtain Data Extract Features Training Model


Learning Process

New Data Features Model Result


Evaluation
How is data obtained?
Train-Test data
Evaluation methodologies
Precision, Recall, F-Measure
Other methods of evaluation, Confusion Matrix
Data Attributes Types

Nominal Categorical Ordinal Continuous


Tools We Use
R Programming Language

Best suited for data- Memory


Pros

oriented problems management, speed,


Strong package and efficiency
ecosystem Not a complete
Graphics and programming
charting language
Simple learning Not for advanced
curve programmers

Cons
Matlab

Contains a lot of Expensive


Pros

advance toolboxes Not as much open


Good source code available
documentation because matlab
Good customer requires a license
support Cannot integrated
your code into a
Webservice

Cons
Python

Great data analysis A lot of cutting edge,


Pros

libraries (pandas, advanced academic


statsmodels) research is still being
Code can be easily done in R/Matlab
integrated into a Advanced features of
web service the language might
Simple and easy to offer a steep learning
begin curve for newcomers

Cons
SAS

Easy to learn Expensive enterprise


Pros

dedicated customer tool


service along with Limited Functionality
the community
Simple learning
curve

Cons
Julia

High performance, Relatively new;


Pros

efficient growing community


good to write a not syntactically
computationally optimized for
intensive program statistical operations
that uses multiple on data arrays
CPUs
Decent visualization
capabilities

Cons
Brainstorm
What’s the Best Language?
The Big Blue
Cloud Computing, Bluemix
and Analytics
SAAS – PAAS – IAAS
IBM Bluemix
Platform as a Service
PaaS Offerings
Zero Infrastructure, Lower Risk
Google App Engine
Lower cost and improved profitability Heroku
Easy and quick development, Monetize quickly IBM Bluemix
Openshift
Reusable code and business logics
Cloud Foundry
Integration with other web services …
https://www.zoho.com/creator/paas.html
Bluemix Offerings

Storage Analytics Watson

And much
Mobile IOT Containers more
IBM Bluemix Data & Analytics

• Cloudant NOSQL DB
Data • Redis
Storage
• IBM DashDB

Graph • IBM Graph


Processing
Number • IBM Analytics for Apache Spark
Crunching
Why Bluemix?

Cloud Service

Ready to Use Platform

Open Source Tools

Because it’s IBM


DashDB
Let’s
Do It
Getting Started
Setup and Basics
Basics of R Programming
Learning further with swirl
Need a Break?
Overview of Machine Learning
What is Machine Learning?
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E

- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E

- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E

- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E

- Tom Mitchell
Supervised Learning - Regression
Linear Regression
modeling the relationship between a scalar dependent variable and one or more
explanatory variables (or independent variables)
If we have only one independent variable, the model is called as simple linear
regression, otherwise, multiple linear regression
Salary in the
First Job

Grade Point Average / Average Marks


Dependent Variable

Independent Variable
Linear Regression
Goal: Find the line such that distance from line to each point is
minimized.

We will “fit” the points with a line, so that an “objective function” is


minimized. The line we thus obtain would minimize the sum of squared
residues (least squares).
S (predictedi – actuali)2 = S (residuei)2
Logistic Regression
A regression model where the dependent variable (DV) is categorical.
Logistic regression is technically a classification technique – do not get confused by
the word “Regression”
? Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Logistic Regression
Goal: Find the parameters to fit

We will “fit” the points with a line, so that an “objective function” is


minimized. The line we thus obtain would minimize the sum of squared
residues (least squares).
The logistic function
 e u
Yi 
1 e u

• Where Y-hat is the estimated probability that the


ith case is in a category and u is the regular linear
regression equation:

u  A  B1 X 1  B2 X 2    BK X K
Supervised Learning - Classification
Nearest Neighbor Approaches
Find k closest training examples, and poll their class values
Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


K Nearest Neighbors (k-NN)
k-NN is a type of instance-based learning, or lazy learning, where the
function is only approximated locally and all computation is deferred
until classification.

One of the simplest machine learning algorithms.


K Nearest Neighbors (k-NN)
Pros

Require no training Doesn’t scale up very well


Easy to understand for large training set.
(require special
Easy for active learning
implementations like KD
processes
Trees)
‘K’ is hard to determine
Require full dataset in
memory

Cons
Decision Trees
Find a model for class attribute as a function of the values of other attributes.
Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Decision Trees
Goal: Build a tree; At each node, split the data on the basis of one
attribute which provides the maximum split
> If Dt contains records that belong the same class yt,
Dt
then t is a leaf node labeled as yt
> If Dt is an empty set, then t is a leaf node labeled by ?
the default class, yd
> If Dt contains records that belong to more than one
class, use an attribute test to split the data into smaller
subsets. Recursively apply the procedure to each
subset.
Decision Trees
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Decision Trees
When a node p is split into k partitions (children), the quality of split is computed as,
k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i, n = number of records at node p.

An alternate method is using Information Gain and Entropy.


Decision Trees – Travel Time to Office
Leave At

10 AM 9 AM
8 AM
Stall? Accident?

No Yes Short
No Yes

Medium Long Medium Long


Decision Trees – Travel Time to Office
if hour == 8am Leave At
commute time = Short 10 AM 9 AM
else if hour == 9am
if accident == yes 8 AM
Stall? Accident?
commute time = long
else Short
No Yes
commute time = medium No Yes
else if hour == 10am
if stall == yes Medium Long Medium Long
commute time = long
else
commute time = medium
Decision Trees
Pros

Inexpensive to construct Time for building a tree


Extremely fast at may be higher than
classifying unknown another type of classifier
records
Easy to interpret for Error propagates as the
small-sized trees number of classes
increases

Cons
Random Forests
Ensemble classifier containing many decision trees and outputs the class that is the
mode of the class's output by individual trees.
Random Forests
Pros

Runs efficiently on large Have been observed to


databases. overfit for noisy datasets.
Can handle thousands of For data including
input variables without categorical variables with
variable deletion. different number of
It has methods for levels, random forests are
balancing error in class biased in favor of those
population unbalanced attributes with more
data sets. levels.

Cons
Support Vector Machines
SVM
Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Got a job

Didn’t get job

Assessment Score
In Programming Skills

Grade Point Average / Average Marks


Support Vector Machines

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }

Classify as.. +1 if w . x + b >= 1


-1 if w . x + b <= -1
+1” 2
= x+ M = Margin Width =
la ss w.w
ic t C one
r ed z
“P
- 1”
x=-
ss
=1 la
+ b
ic t C one
wx
b=
0
Pr ed z M = |x+ - x- | =| l w |=
+
w x b =- 1 “
+
wx
What we know:  λ | w |  λ w.w
• w . x+ + b = +1
• w . x- + b = -1 2 w.w 2
• x+ = x- + l w  
w.w w.w
• |x+ - x- | = M
• 2
λ
w.w
Slide By Andrew W. Moore, CMU
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl ( x k .x l )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:
R
w   αk yk x k Then classify with:

k 1 f(x,w,b) = sign(w. x - b)

b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Slide By Andrew W. Moore, CMU


Support Vector Machines - Kernels

Let’s consider data points in only one dimension for simplicity


Support Vector Machines - Kernels
Support Vector Machines - Kernels
Support Vector Machines - Kernels
The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some transformation Φ: x →
φ(x), the dot product becomes: K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded
feature space.

Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
In 3D

In 1D
Support Vector Machines
Pros

Less prone to overfitting Hyperparameter search is


Needs less memory to important, and complex
store the predictive Deep NN are performing
model better than previously
Reach the global optimum SVM based state-of-the-
due to quadratic art solutions
programming
Works well with smaller
sized datasets

Cons
Naïve Bayes
Apply Bayes’ theorem with the “naive” assumption of independence between
every pair of features
Naïve Bayes
Before the evidence is obtained; prior probability
• P(a) the prior probability that the proposition is true
• P(cavity)=0.1

After the evidence is obtained; posterior probability


• P(a|b)
• The probability of a given that all we know is b
• P(cavity|toothache)=0.8
Naïve Bayes
Bayes Theorem (Thomas Bayes, 1763)
P(a | b)P(b)
P(b | a) 
P(a)
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested
 in the best hypothesis for some space H given observed training
data D.
Naïve Bayes
Bayes Theorem (Thomas Bayes, 1763)
P(a | b)P(b)
P(b | a) 
P(a)
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested
 in the best hypothesis for some space H given observed training
data D.
We can drop this because P(D) is
hMAP  argmax P ( h | D )
hH independent of the hypothesis
P ( D | h) P ( h)
 argmax
hH P( D)
 argmax P( D | h) P(h)
hH
Naïve Bayes
Training Set: instances of different classes described cj as conjunctions of attributes
values
Classify a new instance d based on attribute values into one of the classes cj  C
Key idea: assign the most probable class CMAPusing Bayes Theorem.
cMAP  argmax P(c j | x1 , x2 ,  , xn )
c j C

P ( x1 , x2 ,  , xn | c j ) P (c j )
 argmax
c j C P ( x1 , x2 ,  , xn )
 argmax P( x1 , x2 ,  , xn | c j ) P(c j )
c j C
Naïve Bayes
Pros

Performs at state of the Not suitable for very large


art level for some use- datasets
cases
Performs poorly if
Performs good in small features are correlated
datasets
Converges quickly

Cons
Brainstorm
Which classifier should I use?
Questions
• What is the relative importance of each predictor?
• How does each variable affect the outcome?
• Does a predictor make the solution better or worse or have no effect?
Questions
• Can parameters be accurately predicted?
• How good is the model at classifying cases for which the outcome is known
?
Questions
• What is the prediction equation in the presence of covariates?
• Can prediction models be tested for relative fit to the data?
• So called “goodness of fit” statistics
• What is the strength of association between the outcome variable and a
set of predictors?
• Often in model comparison you want non-significant differences so strength of
association is reported for even non-significant effects.
Unsupervised Learning
Clustering
Draw inferences from datasets consisting of input data without labeled responses.
Clustering is used for exploratory data analysis to find hidden patterns or grouping
in data
Where is Clustering Used?
Marketing: segment customer behaviors
Banking: fraud detection
Gene Analysis: identify gene responsible for a disease
Image Processing: identifying objects in an image (e.g. face recognition)
Insurance: identify policy holders with high average claim cost
K-Means Algorithm
1. Fix a number k = number of required clusters

K=3
K-Means Algorithm
2. Select K random points in the given space

K=3
K-Means Algorithm
3. For each point, find distance from the ‘k’ centroids, and assign it to the closest centroid
K-Means Algorithm
4. Within each newly formed cluster, find the new centroid
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
6. Obtain final clusters
Hands On
Tasks For Today (EDIT THIS PAGE)
1. Set up Project
2. Loading Data – What data is like
3. Understanding Train-Test split
4. Summarizing the Data
5. Extracting Features
6. Training Model
7. Finding Result for unseen test data
8. Evaluation
Hands On Session
What’s Next?

You might also like