Professional Documents
Culture Documents
Session 1
Key Terms
• Business Analytics • Data Mining
• Business Intelligence • Big Data
• Database • Machine Learning
• Data Warehouse • Artificial Intelligence
Data Science: New Science Paradigms
Thousand years ago: Pre 1600
science was empirical
describing natural phenomena
4G c2
theoretical branch a
3 a2
using models, generalizations
Source: https://www.youtube.com/watch?v=Rd8gVeqE-q
Data
Business value Data Lake
Warehouse
Data
Data Mart
Puddle Enterprise Impact
Data s
Cost Savings
Swamp
Limited Scope and Value
No Value
Value
What Makes a Successful Data Analytics?
How Many Cars are there? How much does car weigh?
How Many are Red Cars? What is the Horsepower
Are they of same Model? car?
November 10, 2023 8
Analytics and Statistics
• Analytics deals with what we know Statistics deal with what we don’t
know
Discovery
Operationalize Data
Preparation
Communicating
Results Model
Planning
Model Building
Data analytics Lifecycle
1
Discovery
2
6
Data
Prepration
Operationlize
5 Model
Planning
Communicati
ng Results 4
Model
Building
Discovery Phase
• Learning the business domain
• Resource identification
• Framing the problem
• Identifying key stakeholders
• Interviewing the analytics sponsor
• Developing Initial hypothesis
• Identifying potential data sources
Data Preparation phase
• Preparing analytical sandbox
• Performing ETLT
• Learning about the data
• Data Conditioning
• Survey and visualize
Tools for data preparation phase
• Hadoop
• Alpine Miner
• Openrefine
• Data wrangler
Model Planning Phase
• Data Exploration and Variable selection
• Model selection
Tools for model planning phase
•R
• SQL Analysis services
• SAS/Access
Model Building Phase
• POC construction
• Validation of model
• Parameter adjustment
Tools for model building
• SAS Enterprise Miner
• SPSS Modeler
• Matlab
• Alpine miner
• Statistica
• R
• Octave
• WEKA
• Phython
• SQL
Communicate results
• Training and education
• Documentation
• Version management
Tools used
• Cognos
• SSRS
• Dashboard/scorecard
•R
•…
Operationalize phase
• Making live
• Maintenance and support
Factors causing failures
• Improper planning
• Inadequate project management
• Company not ready for a data warehouse
• Insufficient staff training
• Improper team management
• No support from top management
Implementing the Data Analytics project
• Decide
• the type of data analytics to be built
• where to keep the data analytics project
• where the data is going to come from
• whether you have all the needed data
• who will be using the data analytics project
• how they will use it
• at what times will they use it
Driving Force
• Business Requirements, Not Technology
• Understand the requirements
• Focus on
• user’s needs
• Data needed
• How to provide information
• Use a preliminary survey to gather general requirements before
planning
Challenges for Data Analytics Project Management
DATA ACQUISITION DATA STORAGE INFO. DELIVERY
Task-relevant Data
Data Cleaning
Data Integration
Databases
SEMMA Model
33
CRISP-DMCRISP-DM (CRoss-Industry Standard Process
for Data Mining)
34
35
Comparison between Reporting and Analysis
Reporting Analysis
Data
warehouse
Query &
Analysis
Solution . . . ?
Tools
OLAP Reports
Problem-2
Solution . . . ?
Operational
Wait Database
Extract Data
Data
Warehouse
Problem-3
Solution . . . ? Improvemen
t
Data
should Data Proper
be Query and
Cleaned Warehouse Analysis
tools
Manager
“Data Analysis, Where You Don’t Know the Second
Question to Ask Until You See the Answer to the First One.”
Having great success
with employers
interested in tracking
exercise data.
data
Wants to match users to
personal trainers in
Got data and Excel same locale.
to start.
Operational Data
Database Warehouse
Credit
Loans Customer
Card
Vendor
Trust Product
Savings Activity
To summarize ...
• Operational Systems
are used to “run” a
business
Dimensions
Location = “Delhi”
Item (type)
3 dimensions
• Star Schema
• Snowflake Schema
• Fact Constellation Schema
Star Schema
Snowflake
Fact Constellation
Dimension (Concept) Hierarchies
Store Dimension Product Dimension
Total Total
Region Manufacturer
District Brand
Stores Products
ROLL UP Operation
Also called Drill up operation
Performs aggregation on data cube either by climbing up a concept
hierarchy for a dimension or by dimension reduction
Location
Drill Down Operation
Reverse of roll up operation
Less detailed data to more detailed data
Stepping down a concept hierarchy or introducing the new dimensions
Time
Slice Operation
Selection on one dimension of the given cube, resulting in a sub cube.
Time = “Q1”
Dice Operation
It selects a sub-cube from the OLAP cube by selecting two or more
dimensions
Location = “Delhi” or
“Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
Pivot
It is also known as rotation operation as it rotates the
current view to get a new view of the representation.
Characteristics of DW
• Subject oriented
• Integrated
• Time-variant (time series)
• Nonvolatile
• Summarized
• Not normalized
• Metadata
• Web based, relational/multi-dimensional
• Client/server
• Real-time and/or right-time (active)
How are organizations using the information from data
warehouses?
• Business decision making
• Increasing customer focus, which includes the analysis of customer buying patterns
• Repositioning products and managing product portfolios by comparing performance of
sales by quarters, by years by geographic areas and many more… in order to fine tune
production strategies.
• Analyzing operations and looking for sources of profits
• Managing customer relationships, making environmental corrections and managing the
cost of corporate assets
• Traditional heterogeneous DB integration
The Importance of Data Warehousing
80
Human expnertise Humans are unable
does not exist to explain their
expertise
Solution needs to be adapted to
Solution changes in time particular cases
Models must be customized Models are based on
huge amount of data
Digit Recognition
Example
What is Machine Learning?
Optimize a performance
criterion using example data
or past experience.
88
Types of Machine Learning Machine
Learning
Classification
Regression
• Association
• Clustering Robot Movement
Supervised Learning: Uses
94
Reinforcement Learning
Association rule Mining
Session 4
Association rule mining -Agrawal et al in 1993.
100
Applications – (2)
• Baskets = sentences;
• Items = documents containing those sentences
• Items that appear together too often could represent plagiarism
• Notice items do not have to be “in” baskets
101
Applications(3)
• Baskets = patients;
• Items = drugs & side-effects
• Has been used to detect combinations
of drugs that result in particular side-effects
• But requires extension: Absence of an item
needs to be observed as well as presence
102
Applications – (4)
s buckets Bi
• Looking for Ks,t set of support s
…
sets of size t
A dense 2-layer graph
103
Result of Python
Rule: light cream -> chicken Support: 0.004532728969470737 Confidence:
0.29059829059829057 Lift: 4.84395061728395
=====================================
Rule: mushroom cream sauce -> escalope Support: 0.005732568990801126
Confidence: 0.3006993006993007 Lift: 3.790832696715049
=====================================
Rule: escalope -> pasta Support: 0.005865884548726837 Confidence:
0.3728813559322034 Lift: 4.700811850163794
=====================================
Rule: ground beef -> herb & pepper Support: 0.015997866951073192
Confidence: 0.3234501347708895 Lift: 3.2919938411349285
=====================================
Result of Python
light cream -> chicken
Support: 0. 004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
Terms
“IF”
part = antecedent
“THEN” part = consequent
support( I j )
conf( I j )
support( I )
Lift
Lift defines the strength of a rule It is the ratio of the
observed support measure and expected support if X and Y
are independent of each other.
Support (A and B)
ift (A B) = ---------------------------------------------
Support(A) X Support(B)
109
Interpretation of Lift
110
Interpretation
⚫ Support measures overall impact
⚫Confidence shows the rate at which consequents
will be found (useful in learning costs of
promotion)
⚫Lift ratio show how effective the rule is in
finding consequents (useful if finding particular
consequents is important)
Frequent Item Sets
⚫Ideally, we want to create all possible combinations of
items
113
Interesting Association Rules
• Not all high-confidence rules are interesting
• The rule X → milk may have high confidence for many
itemsets X, because milk is just purchased very often
(independent of X) and the confidence will be high
114
Association Rule
Total Number of
Transactions = 2000
TEA MILK
Diaper 5 4 9
No Diaper 6 3 9
11 7 18
Negative Rules
Support(Bear, Diaper) =5/18
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
124
Measures of Rule Performance
demographic characteristics
psychographics
desired benefits from products/services and
past-purchase and product-use behaviors.
Cluster Analysis
• Cluster analysis is a class of statistical techniques that can be applied
to data that exhibit natural groupings.
• Cluster analysis makes no distinction between dependent and
independent variables. The entire set of interdependent
relationships is examined.
• Cluster analysis sorts through the raw data on customers and groups
them into clusters. A cluster is a group of relatively homogeneous
customers.
Basic step of Cluster Analysis
1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
for browsing, group genes Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Technology1-DOWN
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
• Reduce the size of large
data sets
Clustering precipitation in
Australia
144
Clustering for Data Understanding and
Applications
147
Types of Clusterings
• Partitional Clustering
•A division data objects into non-overlapping
subsets (clusters) such that each data object is in
exactly one subset
• Hierarchical clustering
•A set of nested clusters organized as a hierarchical
tree
148
Partitional Clustering (Bölümsel Kümeleme)
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
150
Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
151
Hard vs. soft clustering
• Hard clustering: Each item belongs to exactly one cluster
• More common and easier to do
• Soft clustering: An item can belong to more than one cluster.
• Makes more sense for applications like creating browsable
hierarchies
• You may want to put a pair of sneakers in two clusters: (i)
sports apparel and (ii) shoes
• You can only do that with a soft clustering approach.
Hierarchical Clustering
Divisive
A Simple example showing the implementation of k-
means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids we
compute the Euclidean
distance of each object, as
shown in table.
164
Supervised Learning
Session 6
Supervised Learning
Supervised Laerning
167
Example: Spam Filter
168
Classification Examples
• OCR (input: images, classes: characters)
• Medical diagnosis (input: symptoms, classes: diseases)
• Automatic essay grader (input: document, classes: grades)
• Fraud detection (input: account activity, classes: fraud / no
fraud)
• Customer service email routing
• Recommended articles in a newspaper, recommended
books
• Financial investments
169
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
170
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
173
Decision Tree
Working of Decision tree
• Select the best attribute using Attribute Selection Measures (ASM) to
split the records.
• Make that attribute a decision node and breaks the dataset into smaller
subsets.
• Start tree building by repeating this process recursively for each child
until one of the conditions will match:
• All the tuples belong to the same attribute value.
• There are no more remaining attributes.
• There are no more instances.
Attribute Selection
•Information Gain (Entropy)
•Gain Ratio
•Gini index
Logistic Regression
Confusion metric
• A confusion matrix is defined as the table that is
often used to describe the performance of a
classification model on a set of the test data for
which the true values are known.
• True Positive: We predicted positive and it’s true. In the image, we
predicted that a woman is pregnant and she actually is.
• True Negative: We predicted negative and it’s true. In the image,
we predicted that a man is not pregnant and he actually is not.
• False Positive (Type 1 Error): We predicted positive and it’s false. In
the image, we predicted that a man is pregnant but he actually is
not.
• False Negative (Type 2 Error): We predicted negative and it’s false.
In the image, we predicted that a woman is not pregnant but she
actually is.
Classification Metric
• Accuracy - - Accuracy simply measures how often the
classifier correctly predicts. We can define accuracy as
the ratio of the number of correct predictions and the
total number of predictions.
• Precision explains how many of the correctly
predicted cases actually turned out to be
positive. Precision is useful in the cases where
False Positive is a higher concern than False
Negatives. The importance of Precision is in music
or video recommendation systems, e-commerce
websites, etc. where wrong results could lead to
customer churn and this could be harmful to the
business.
• Recall (sensitivity) explains how many of the actual
positive cases we were able to predict correctly with
our model. Recall is a useful metric in cases where False
Negative is of higher concern than False Positive. It is
important in medical cases where it doesn’t matter
whether we raise a false alarm but the actual positive
cases should not go undetected!
•F1 Score gives a combined idea about
Precision and Recall metrics. It is
maximum when Precision is equal to
Recall.
When F1 score is useful