Aditya Slides For IBM

ANALYTICS BOOT CAMP
COURSE
Analytics Boot Camp Course
Aditya Joshi
Kumar Rishabh
Srinivas Nv Gannavarapu
the science of examining raw data
with the purpose of drawing
conclusions about that information
- Techtarget
Data analytics refers to qualitative and
quantitative techniques and processes used
to enhance productivity and business gain.
Data is extracted and categorized to identify
and analyze behavioral data and patterns,
and techniques vary according to
organizational requirements. - Techopedia
Brainstorm
Where else is Data Analytics used?
Workflow
• Planning, organizing and requirement gathering
• Gathering Data
• Data Cleaning
• Analyzing Data, Predictive Modelling and Result Generation
• Result Presentation
Learning Process
Obtain Data Extract Features Training Model

Learning Process
New Data Features Model Result

Evaluation
How is data obtained?
Train-Test data
Evaluation methodologies
Precision, Recall, F-Measure
Other methods of evaluation, Confusion Matrix
Data Attributes Types
Nominal Categorical Ordinal Continuous

Tools We Use
R Programming Language
Best suited for data- Memory

Pros
oriented problems management, speed,

Strong package and efficiency
ecosystem Not a complete
Graphics and programming
charting language
Simple learning Not for advanced
curve programmers
Cons
Matlab
Contains a lot of Expensive

Pros
advance toolboxes Not as much open

Good source code available
documentation because matlab
Good customer requires a license
support Cannot integrated
your code into a
Webservice
Cons
Python
Great data analysis A lot of cutting edge,

Pros
libraries (pandas, advanced academic

statsmodels) research is still being
Code can be easily done in R/Matlab
integrated into a Advanced features of
web service the language might
Simple and easy to offer a steep learning
begin curve for newcomers
Cons
SAS
Easy to learn Expensive enterprise

Pros
dedicated customer tool

service along with Limited Functionality
the community
Simple learning
curve
Cons
Julia
High performance, Relatively new;

Pros
efficient growing community

good to write a not syntactically
computationally optimized for
intensive program statistical operations
that uses multiple on data arrays
CPUs
Decent visualization
capabilities
Cons
Brainstorm
What’s the Best Language?
The Big Blue
Cloud Computing, Bluemix
and Analytics
SAAS – PAAS – IAAS
IBM Bluemix
Platform as a Service
PaaS Offerings
Zero Infrastructure, Lower Risk
Google App Engine
Lower cost and improved profitability Heroku
Easy and quick development, Monetize quickly IBM Bluemix
Openshift
Reusable code and business logics
Cloud Foundry
Integration with other web services …
https://www.zoho.com/creator/paas.html
Bluemix Offerings
Storage Analytics Watson
And much
Mobile IOT Containers more
IBM Bluemix Data & Analytics
• Cloudant NOSQL DB
Data • Redis
Storage
• IBM DashDB
Graph • IBM Graph

Processing
Number • IBM Analytics for Apache Spark
Crunching
Why Bluemix?
Cloud Service
Ready to Use Platform
Open Source Tools
Because it’s IBM

DashDB
Let’s
Do It
Getting Started
Setup and Basics
Basics of R Programming
Learning further with swirl
Need a Break?
Overview of Machine Learning
What is Machine Learning?
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell
- Tom Mitchell
- Tom Mitchell
- Tom Mitchell
Supervised Learning - Regression
Linear Regression
modeling the relationship between a scalar dependent variable and one or more
explanatory variables (or independent variables)
If we have only one independent variable, the model is called as simple linear
regression, otherwise, multiple linear regression
Salary in the
First Job
Grade Point Average / Average Marks

Dependent Variable
Independent Variable
Linear Regression
Goal: Find the line such that distance from line to each point is
minimized.
We will “fit” the points with a line, so that an “objective function” is

minimized. The line we thus obtain would minimize the sum of squared
residues (least squares).
S (predictedi – actuali)2 = S (residuei)2
Logistic Regression
A regression model where the dependent variable (DV) is categorical.
Logistic regression is technically a classification technique – do not get confused by
the word “Regression”
? Got a job
Didn’t get job
Assessment Score
In Programming Skills

Got a job
Didn’t get job
Assessment Score

Got a job
Didn’t get job
Assessment Score

Got a job
Didn’t get job
Assessment Score

Logistic Regression
Goal: Find the parameters to fit
We will “fit” the points with a line, so that an “objective function” is

minimized. The line we thus obtain would minimize the sum of squared
residues (least squares).
The logistic function
 e u
Yi 
1 e u
• Where Y-hat is the estimated probability that the

ith case is in a category and u is the regular linear
regression equation:
u  A  B1 X 1  B2 X 2    BK X K
Supervised Learning - Classification
Nearest Neighbor Approaches
Find k closest training examples, and poll their class values
Got a job
Didn’t get job
Assessment Score

K Nearest Neighbors (k-NN)
k-NN is a type of instance-based learning, or lazy learning, where the
function is only approximated locally and all computation is deferred
until classification.
One of the simplest machine learning algorithms.

K Nearest Neighbors (k-NN)
Pros
Require no training Doesn’t scale up very well

Easy to understand for large training set.
(require special
Easy for active learning
implementations like KD
processes
Trees)
‘K’ is hard to determine
Require full dataset in
memory
Cons
Decision Trees
Find a model for class attribute as a function of the values of other attributes.
Got a job
Didn’t get job
Assessment Score

Got a job
Didn’t get job
Assessment Score

Decision Trees
Goal: Build a tree; At each node, split the data on the basis of one
attribute which provides the maximum split
> If Dt contains records that belong the same class yt,
Dt
then t is a leaf node labeled as yt
> If Dt is an empty set, then t is a leaf node labeled by ?
the default class, yd
> If Dt contains records that belong to more than one
class, use an attribute test to split the data into smaller
subsets. Recursively apply the procedure to each
subset.
Decision Trees
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Decision Trees
When a node p is split into k partitions (children), the quality of split is computed as,
k
ni
GINI split   GINI (i )
i 1 n
where, ni = number of records at child i, n = number of records at node p.
An alternate method is using Information Gain and Entropy.

Decision Trees – Travel Time to Office
Leave At
10 AM 9 AM
8 AM
Stall? Accident?
No Yes Short
No Yes
Medium Long Medium Long

Decision Trees – Travel Time to Office
if hour == 8am Leave At
commute time = Short 10 AM 9 AM
else if hour == 9am
if accident == yes 8 AM
Stall? Accident?
commute time = long
else Short
No Yes
commute time = medium No Yes
else if hour == 10am
if stall == yes Medium Long Medium Long
commute time = long
else
commute time = medium
Decision Trees
Pros
Inexpensive to construct Time for building a tree

Extremely fast at may be higher than
classifying unknown another type of classifier
records
Easy to interpret for Error propagates as the
small-sized trees number of classes
increases
Cons
Random Forests
Ensemble classifier containing many decision trees and outputs the class that is the
mode of the class's output by individual trees.
Random Forests
Pros
Runs efficiently on large Have been observed to

databases. overfit for noisy datasets.
Can handle thousands of For data including
input variables without categorical variables with
variable deletion. different number of
It has methods for levels, random forests are
balancing error in class biased in favor of those
population unbalanced attributes with more
data sets. levels.
Cons
Support Vector Machines
SVM
Got a job
Didn’t get job
Assessment Score

Got a job
Didn’t get job
Assessment Score

Got a job
Didn’t get job
Assessment Score

Got a job
Didn’t get job
Assessment Score

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify as.. +1 if w . x + b >= 1

-1 if w . x + b <= -1
+1” 2
= x+ M = Margin Width =
la ss w.w
ic t C one
r ed z
“P
- 1”
x=-
ss
=1 la
+ b
ic t C one
wx
b=
0
Pr ed z M = |x+ - x- | =| l w |=
+
w x b =- 1 “
+
wx
What we know:  λ | w |  λ w.w
• w . x+ + b = +1
• w . x- + b = -1 2 w.w 2
• x+ = x- + l w  
w.w w.w
• |x+ - x- | = M
• 2
λ
w.w
Slide By Andrew W. Moore, CMU
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl ( x k .x l )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0
Then define:
R
w   αk yk x k Then classify with:
k 1 f(x,w,b) = sign(w. x - b)
b  y K (1  ε K )  x K .w K
where K  arg max αk
k
Slide By Andrew W. Moore, CMU

Support Vector Machines - Kernels
Let’s consider data points in only one dimension for simplicity

The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some transformation Φ: x →
φ(x), the dot product becomes: K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in some expanded
feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
In 3D
In 1D
Pros
Less prone to overfitting Hyperparameter search is

Needs less memory to important, and complex
store the predictive Deep NN are performing
model better than previously
Reach the global optimum SVM based state-of-the-
due to quadratic art solutions
programming
Works well with smaller
sized datasets
Cons
Naïve Bayes
Apply Bayes’ theorem with the “naive” assumption of independence between
every pair of features
Naïve Bayes
Before the evidence is obtained; prior probability
• P(a) the prior probability that the proposition is true
• P(cavity)=0.1
After the evidence is obtained; posterior probability

• P(a|b)
• The probability of a given that all we know is b
• P(cavity|toothache)=0.8
Naïve Bayes
Bayes Theorem (Thomas Bayes, 1763)
P(a | b)P(b)
P(b | a) 
P(a)
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested
 in the best hypothesis for some space H given observed training
data D.
Naïve Bayes
Bayes Theorem (Thomas Bayes, 1763)
P(a | b)P(b)
P(b | a) 
P(a)
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested
 in the best hypothesis for some space H given observed training
data D.
We can drop this because P(D) is
hMAP  argmax P ( h | D )
hH independent of the hypothesis
P ( D | h) P ( h)
 argmax
hH P( D)
 argmax P( D | h) P(h)
hH
Naïve Bayes
Training Set: instances of different classes described cj as conjunctions of attributes
values
Classify a new instance d based on attribute values into one of the classes cj  C
Key idea: assign the most probable class CMAPusing Bayes Theorem.
cMAP  argmax P(c j | x1 , x2 ,  , xn )
c j C
P ( x1 , x2 ,  , xn | c j ) P (c j )
 argmax
c j C P ( x1 , x2 ,  , xn )
 argmax P( x1 , x2 ,  , xn | c j ) P(c j )
c j C
Naïve Bayes
Pros
Performs at state of the Not suitable for very large

art level for some use- datasets
cases
Performs poorly if
Performs good in small features are correlated
datasets
Converges quickly
Cons
Brainstorm
Which classifier should I use?
Questions
• What is the relative importance of each predictor?
• How does each variable affect the outcome?
• Does a predictor make the solution better or worse or have no effect?
Questions
• Can parameters be accurately predicted?
• How good is the model at classifying cases for which the outcome is known
?
Questions
• What is the prediction equation in the presence of covariates?
• Can prediction models be tested for relative fit to the data?
• So called “goodness of fit” statistics
• What is the strength of association between the outcome variable and a
set of predictors?
• Often in model comparison you want non-significant differences so strength of
association is reported for even non-significant effects.
Unsupervised Learning
Clustering
Draw inferences from datasets consisting of input data without labeled responses.
Clustering is used for exploratory data analysis to find hidden patterns or grouping
in data
Where is Clustering Used?
Marketing: segment customer behaviors
Banking: fraud detection
Gene Analysis: identify gene responsible for a disease
Image Processing: identifying objects in an image (e.g. face recognition)
Insurance: identify policy holders with high average claim cost
K-Means Algorithm
1. Fix a number k = number of required clusters
K=3
K-Means Algorithm
2. Select K random points in the given space
K=3
K-Means Algorithm
3. For each point, find distance from the ‘k’ centroids, and assign it to the closest centroid
K-Means Algorithm
4. Within each newly formed cluster, find the new centroid
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
6. Obtain final clusters
Hands On
Tasks For Today (EDIT THIS PAGE)
1. Set up Project
2. Loading Data – What data is like
3. Understanding Train-Test split
4. Summarizing the Data
5. Extracting Features
6. Training Model
7. Finding Result for unseen test data
8. Evaluation
Hands On Session
What’s Next?

Aditya Slides For IBM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aditya Slides For IBM

Uploaded by

Copyright:

Available Formats

ANALYTICS BOOT CAMP

Obtain Data Extract Features Training Model

New Data Features Model Result

Nominal Categorical Ordinal Continuous

Best suited for data- Memory

oriented problems management, speed,

Contains a lot of Expensive

advance toolboxes Not as much open

Great data analysis A lot of cutting edge,

libraries (pandas, advanced academic

Easy to learn Expensive enterprise

dedicated customer tool

High performance, Relatively new;

efficient growing community

Storage Analytics Watson

Graph • IBM Graph

Ready to Use Platform

Open Source Tools

Because it’s IBM

Grade Point Average / Average Marks

We will “fit” the points with a line, so that an “objective function” is

Didn’t get job

Grade Point Average / Average Marks

Didn’t get job

Grade Point Average / Average Marks

Didn’t get job

Grade Point Average / Average Marks

Didn’t get job

Grade Point Average / Average Marks

We will “fit” the points with a line, so that an “objective function” is

• Where Y-hat is the estimated probability that the

Didn’t get job

Grade Point Average / Average Marks

One of the simplest machine learning algorithms.

Require no training Doesn’t scale up very well

Didn’t get job

Grade Point Average / Average Marks

Didn’t get job

Grade Point Average / Average Marks

where, ni = number of records at child i, n = number of records at node p.

An alternate method is using Information Gain and Entropy.

Medium Long Medium Long

Inexpensive to construct Time for building a tree

Runs efficiently on large Have been observed to

Didn’t get job

Grade Point Average / Average Marks

Didn’t get job

Grade Point Average / Average Marks

Didn’t get job

Grade Point Average / Average Marks

Didn’t get job

Grade Point Average / Average Marks

Classify as.. +1 if w . x + b >= 1

Slide By Andrew W. Moore, CMU

Let’s consider data points in only one dimension for simplicity

Less prone to overfitting Hyperparameter search is

After the evidence is obtained; posterior probability

Performs at state of the Not suitable for very large

You might also like