Professional Documents
Culture Documents
COURSE
Analytics Boot Camp Course
Aditya Joshi
Kumar Rishabh
Srinivas Nv Gannavarapu
the science of examining raw data
with the purpose of drawing
conclusions about that information
- Techtarget
Data analytics refers to qualitative and
quantitative techniques and processes used
to enhance productivity and business gain.
Data is extracted and categorized to identify
and analyze behavioral data and patterns,
and techniques vary according to
organizational requirements. - Techopedia
Brainstorm
Where else is Data Analytics used?
Workflow
• Planning, organizing and requirement gathering
• Gathering Data
• Data Cleaning
• Analyzing Data, Predictive Modelling and Result Generation
• Result Presentation
Learning Process
Cons
Matlab
Cons
Python
Cons
SAS
Cons
Julia
Cons
Brainstorm
What’s the Best Language?
The Big Blue
Cloud Computing, Bluemix
and Analytics
SAAS – PAAS – IAAS
IBM Bluemix
Platform as a Service
PaaS Offerings
Zero Infrastructure, Lower Risk
Google App Engine
Lower cost and improved profitability Heroku
Easy and quick development, Monetize quickly IBM Bluemix
Openshift
Reusable code and business logics
Cloud Foundry
Integration with other web services …
https://www.zoho.com/creator/paas.html
Bluemix Offerings
And much
Mobile IOT Containers more
IBM Bluemix Data & Analytics
• Cloudant NOSQL DB
Data • Redis
Storage
• IBM DashDB
Cloud Service
- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell
Supervised Learning - Regression
Linear Regression
modeling the relationship between a scalar dependent variable and one or more
explanatory variables (or independent variables)
If we have only one independent variable, the model is called as simple linear
regression, otherwise, multiple linear regression
Salary in the
First Job
Independent Variable
Linear Regression
Goal: Find the line such that distance from line to each point is
minimized.
Assessment Score
In Programming Skills
Assessment Score
In Programming Skills
Assessment Score
In Programming Skills
Assessment Score
In Programming Skills
u A B1 X 1 B2 X 2 BK X K
Supervised Learning - Classification
Nearest Neighbor Approaches
Find k closest training examples, and poll their class values
Got a job
Assessment Score
In Programming Skills
Cons
Decision Trees
Find a model for class attribute as a function of the values of other attributes.
Got a job
Assessment Score
In Programming Skills
Assessment Score
In Programming Skills
10 AM 9 AM
8 AM
Stall? Accident?
No Yes Short
No Yes
Cons
Random Forests
Ensemble classifier containing many decision trees and outputs the class that is the
mode of the class's output by individual trees.
Random Forests
Pros
Cons
Support Vector Machines
SVM
Got a job
Assessment Score
In Programming Skills
Assessment Score
In Programming Skills
Assessment Score
In Programming Skills
Assessment Score
In Programming Skills
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Then define:
R
w αk yk x k Then classify with:
k 1 f(x,w,b) = sign(w. x - b)
b y K (1 ε K ) x K .w K
where K arg max αk
k
A kernel function is some function that corresponds to an inner product in some expanded
feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
In 3D
In 1D
Support Vector Machines
Pros
Cons
Naïve Bayes
Apply Bayes’ theorem with the “naive” assumption of independence between
every pair of features
Naïve Bayes
Before the evidence is obtained; prior probability
• P(a) the prior probability that the proposition is true
• P(cavity)=0.1
P ( x1 , x2 , , xn | c j ) P (c j )
argmax
c j C P ( x1 , x2 , , xn )
argmax P( x1 , x2 , , xn | c j ) P(c j )
c j C
Naïve Bayes
Pros
Cons
Brainstorm
Which classifier should I use?
Questions
• What is the relative importance of each predictor?
• How does each variable affect the outcome?
• Does a predictor make the solution better or worse or have no effect?
Questions
• Can parameters be accurately predicted?
• How good is the model at classifying cases for which the outcome is known
?
Questions
• What is the prediction equation in the presence of covariates?
• Can prediction models be tested for relative fit to the data?
• So called “goodness of fit” statistics
• What is the strength of association between the outcome variable and a
set of predictors?
• Often in model comparison you want non-significant differences so strength of
association is reported for even non-significant effects.
Unsupervised Learning
Clustering
Draw inferences from datasets consisting of input data without labeled responses.
Clustering is used for exploratory data analysis to find hidden patterns or grouping
in data
Where is Clustering Used?
Marketing: segment customer behaviors
Banking: fraud detection
Gene Analysis: identify gene responsible for a disease
Image Processing: identifying objects in an image (e.g. face recognition)
Insurance: identify policy holders with high average claim cost
K-Means Algorithm
1. Fix a number k = number of required clusters
K=3
K-Means Algorithm
2. Select K random points in the given space
K=3
K-Means Algorithm
3. For each point, find distance from the ‘k’ centroids, and assign it to the closest centroid
K-Means Algorithm
4. Within each newly formed cluster, find the new centroid
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
6. Obtain final clusters
Hands On
Tasks For Today (EDIT THIS PAGE)
1. Set up Project
2. Loading Data – What data is like
3. Understanding Train-Test split
4. Summarizing the Data
5. Extracting Features
6. Training Model
7. Finding Result for unseen test data
8. Evaluation
Hands On Session
What’s Next?