You are on page 1of 189

Business Analytics

Session 1
Key Terms
• Business Analytics • Data Mining
• Business Intelligence • Big Data
• Database • Machine Learning
• Data Warehouse • Artificial Intelligence
Data Science: New Science Paradigms
Thousand years ago: Pre 1600
science was empirical
describing natural phenomena

• Last few hundred years: 1600 – 1900  .


a
2

4G  c2

theoretical branch  a    
3 a2
 
using models, generalizations

• Last few decades: 1900- 2000


a computational branch
simulating complex phenomena

• Today: post 2000


data exploration (data Science)
unify theory, experiment, and simulation
using data management and statistics
• Data captured by instruments
Or generated by simulator
• Processed by software 3
• Scientist analyzes database / files
Industrial Revolution

Source: https://www.youtube.com/watch?v=Rd8gVeqE-q
Data
Business value Data Lake
Warehouse
Data
Data Mart
Puddle Enterprise Impact
Data s
Cost Savings
Swamp
Limited Scope and Value
No Value
Value
What Makes a Successful Data Analytics?

Right Platform + Right Data + Right Interface


Analytics without subject expertise may be
dangerous
Analytics and Statistics

How Many Cars are there? How much does car weigh?
How Many are Red Cars? What is the Horsepower
Are they of same Model? car?
November 10, 2023 8
Analytics and Statistics
• Analytics deals with what we know Statistics deal with what we don’t
know

• Analytics is looking and making inference from facts Statistics is


learning beyond the facts

• Analytics deals with certainity Statistics deal with uncertainity

• Analytics require coding skills


November 10, 2023 9
Evolution of Analytics
• Analytics 1.0—the era of “business intelligence.”
- Data Integration within the organization
- IBM DB2, Oracle V3, Sybase (SAP)
- Descriptive and Diagnostic Analytics
• Analytics 2.0—the era of big data
- Internet goes global
- Amazon (1995), Hotmail (1996), PayPal (1998), Google (1998)
- Predictive analytics and Advanced Analytics
• Analytics 3.0—the era of data-enriched offerings
- world goes social
- Data Products
- LinkedIn (2003), Skype (2003), Facebook (2004), Twitter (2006)
- Prescriptive Analytics
• Analytics 4.0 : “Era of Agumented Reality”
• - Fast-Pervasive Data” is replacing “Big Data”
•-
Analytics Maturity Model
Data . . . . . . . . breadth, integration, quality
Enterprise . . . . . . . .approach to managing analytics
Leadership . . . . . . . . . . . . passion and commitment
Targets . . . . . . . . . . . first deep, then broad
Analysts . . . . . professionals and amateurs More analytical =
Analytical Competitors higher performanc
Analytical Companies
Analytical Aspirations
PROGRESS Localized Analytics
Analytically Impaired
Moving to:
Analytically Analytical Analytical
Success
Factor Impaired Localized Analytics Aspirations Analytical Companies Competitors
Data Inconsistent, poor Data useable, but in Organization Integrated, accurate, Relentless search
quality, poorly functional or process beginning to common data in central for new data and
organized silos create centralized warehouse metrics
data repository
Enterprise Islands of data, Early stages of an Key data, technology All key analytical
technology, and enterprise-wide and analysts are resources centrally
expertise approach central-ized or managed
networked
Leadership No awareness or Only at the function or Leaders beginning Leadership support for Strong leadership
interest process level to recognize analytical competence passion for
importance of analytical
analytics competition
Targets Multiple disconnected Analytical efforts Analytical activity Analytics support the
targets that may not coalescing behind centered on a few key firm’s distinctive
be strategically a small set of domains capability and
important targets strategy
Analysts Few skills, and Isolated pockets of Influx of analysts Highly capable analysts World-class
these attached to analysts with no in key target areas in central or networked professional
specific functions communication organization analysts and
attention to
Key roles in successful analytics project
• Business user – Understands the domain Area
• Project Sponsor -- Provides requirements
• Project Manager -- Ensures Meeting Objectives
• Business Intelligence analysts -- Provides business domain
expertise based on deep understanding of the data
• Database Administrator -- Creates DB Environment
• Data engineer -- provides technical skills, assists data
management and extraction, supports analytical activities
• Data Scientist -- Provides analytic techniques and modeling
Analytical Life Cycle

Discovery

Operationalize Data
Preparation

Communicating
Results Model
Planning

Model Building
Data analytics Lifecycle
1

Discovery
2
6
Data
Prepration
Operationlize

5 Model
Planning
Communicati
ng Results 4

Model
Building
Discovery Phase
• Learning the business domain
• Resource identification
• Framing the problem
• Identifying key stakeholders
• Interviewing the analytics sponsor
• Developing Initial hypothesis
• Identifying potential data sources
Data Preparation phase
• Preparing analytical sandbox
• Performing ETLT
• Learning about the data
• Data Conditioning
• Survey and visualize
Tools for data preparation phase
• Hadoop
• Alpine Miner
• Openrefine
• Data wrangler
Model Planning Phase
• Data Exploration and Variable selection
• Model selection
Tools for model planning phase
•R
• SQL Analysis services
• SAS/Access
Model Building Phase
• POC construction
• Validation of model
• Parameter adjustment
Tools for model building
• SAS Enterprise Miner
• SPSS Modeler
• Matlab
• Alpine miner
• Statistica
• R
• Octave
• WEKA
• Phython
• SQL
Communicate results
• Training and education
• Documentation
• Version management
Tools used
• Cognos
• SSRS
• Dashboard/scorecard
•R
•…
Operationalize phase
• Making live
• Maintenance and support
Factors causing failures

• Improper planning
• Inadequate project management
• Company not ready for a data warehouse
• Insufficient staff training
• Improper team management
• No support from top management
Implementing the Data Analytics project
• Decide
• the type of data analytics to be built
• where to keep the data analytics project
• where the data is going to come from
• whether you have all the needed data
• who will be using the data analytics project
• how they will use it
• at what times will they use it
Driving Force
• Business Requirements, Not Technology
• Understand the requirements
• Focus on
• user’s needs
• Data needed
• How to provide information
• Use a preliminary survey to gather general requirements before
planning
Challenges for Data Analytics Project Management
DATA ACQUISITION DATA STORAGE INFO. DELIVERY

• Large number of sources  Storage of large data  Several user types


• Many disparate sources  volumes  Queries stretched to
• Different computing  Rapid growth limits
platforms  Need for parallel  Multiple query types
• Outside sources processing  Web-enabled
• Huge initial load  Data storage in staging  Multidimensional
area analysis
• Ongoing data feeds
 Multiple index types  OLAP functionality
• Data replication
considerations  Several index files  Metadata management
• Difficult data integration  Storage of newer data  Interfaces to DSS
types apps.
• Complex data
transformations  Archival of old data  Feed into Data Mining
• Data cleansing  Compatibility with tools  Multi-vendor tools
 RDBMS & MDDBMS
Popular analytics processes
• KDD
• Scientific methods
• CRISP –DM
• DELTA
• Applied information economics
• MAD skills
KDD Process

Data mining: the core of Pattern Evaluation


knowledge discovery
process.
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
SEMMA Model

33
CRISP-DMCRISP-DM (CRoss-Industry Standard Process
for Data Mining)

• Data Mining methodology


• Process Model
• For anyone
• Provides a complete blueprint
• Life cycle: 6 phases

34
35
Comparison between Reporting and Analysis
Reporting Analysis

Provides data Provides answers

Provides what is asked for Provides what is needed

Is typically standardized Is typically customized

Does not involve a person Involves a person

Is fairy inflexible Is extremely flexible


36
Business Analytics
Sesion 2
Are Database good tool for Data Analysis?
Problem-1
Average Profit per
branch?

Data
warehouse

Query &
Analysis

Solution . . . ?
Tools

OLAP Reports
Problem-2
Solution . . . ?

Operational
Wait Database

Extract Data

Data
Warehouse
Problem-3
Solution . . . ? Improvemen
t

Data
should Data Proper
be Query and
Cleaned Warehouse Analysis
tools

Manager
“Data Analysis, Where You Don’t Know the Second
Question to Ask Until You See the Answer to the First One.”
Having great success
with employers
interested in tracking
exercise data.
data
Wants to match users to
personal trainers in
Got data and Excel same locale.
to start.

How to track them?


Mailing address?
IP address? Earn referral fee.

What impact will


new products/services
have on revenue
and margins?
Necessity is the mother of
invention
Operational Systems
• Run the business in real time
• Based on up-to-the-second data
• Optimized to handle large numbers of simple read/write
transactions
• Optimized for fast response to predefined transactions
• Used by people who deal with customers, products --
clerks, salespeople etc.
• They are increasingly used by customers
Data Warehouse
• A data warehouse is a
• subject-oriented
• integrated
• time-varying
• non-volatile
collection of data that is used primarily in organizational
decision making.
-- Bill Inmon, Building the Data Warehouse 1996
So, what’s different?
Application-Orientation vs. Subject-Orientation
Application-Orientation Subject-Orientation

Operational Data
Database Warehouse

Credit
Loans Customer
Card
Vendor
Trust Product

Savings Activity
To summarize ...

• Operational Systems
are used to “run” a
business

• The Data Warehouse


helps to “optimize” the
business
Data Sources ETL Software Data Stores Data Analysis Users
Tools and
Applications
Transaction Data S
T
IBM A
Prod G
I
N SQL
G ANALYSTS
Mkt IMS
A
R Cognos
Ascential E
HR VSAM Data Mart s
A Teradata SAS
IBM MANAGERS
O Fi nance
Oracle P
Fin E Load
R Data Essbase Queries,Reporting,
Extract A Warehouse DSS/EIS,
T Informatica
Syba se Data Mining
Acctg I Marketing
O EXECUTIVES
Other Internal Data N Micro Strategy
A Meta
L Data Sales
ERP SAP Sagent
D
A Microsoft Si ebel
Web Data T
A Business OPERATIO NAL
Infor mix Objects PERSO NNEL
Clickstream S
T
SAS O
External Data R Web
E Browser
Demographic Harte-
Hanks Clean/Scrub
Cl ean/Scrub CUSTOMERS/
Trans form SUPPLIERS
Fi rst
rstlogic
logic
Two Data Warehousing Strategies

• Enterprise-wide warehouse, top down, the Inmon methodology


• Data mart, bottom up, the Kimball methodology
• When properly executed, both result in an enterprise-wide data
warehouse
Data warehouse design process
• Identify the business process
• Declare the granularity
• Identify the dimensions
• Identify the facts
Sample receipt
Data warehouse design process
• Identify the business process
• Declare the granularity
• Identify the dimensions
• Identify the measures
Dimension Fact
On-line Analytical Processing (OLAP)

• A set of functionality that facilitates multidimensional analysis


• Allows users to analyze data in ways that are natural to them
• Comes in many varieties -- ROLAP, MOLAP, DOLAP, etc.
A Sample Data Cube
2- D view of sales data for XYZ Electronics

Dimensions
Location = “Delhi”

Item (type)

Time (quarter) Home Computer Phone Security


Theatre
Q1 605 825 14 400
Q2 680 952 31 512
Q3 812 1023 30 501

Q4 927 1038 38 580


Measures = Sold ( in INR)
3- D view of sales data for XYZ Electronics

3 dimensions

“ Delhi” “ Kolkata “ Istanbul” “ Karachi”


Karachi
Istanbul
Kolkata
Delhi
Conceptual Modeling of
Data Warehouses

• Star Schema
• Snowflake Schema
• Fact Constellation Schema
Star Schema
Snowflake
Fact Constellation
Dimension (Concept) Hierarchies
Store Dimension Product Dimension

Total Total

Region Manufacturer

District Brand

Stores Products
ROLL UP Operation
Also called Drill up operation
Performs aggregation on data cube either by climbing up a concept
hierarchy for a dimension or by dimension reduction

Location
Drill Down Operation
Reverse of roll up operation
Less detailed data to more detailed data
Stepping down a concept hierarchy or introducing the new dimensions

Time
Slice Operation
Selection on one dimension of the given cube, resulting in a sub cube.

Time = “Q1”
Dice Operation
It selects a sub-cube from the OLAP cube by selecting two or more
dimensions

Location = “Delhi” or
“Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
Pivot
It is also known as rotation operation as it rotates the
current view to get a new view of the representation.
Characteristics of DW
• Subject oriented
• Integrated
• Time-variant (time series)
• Nonvolatile
• Summarized
• Not normalized
• Metadata
• Web based, relational/multi-dimensional
• Client/server
• Real-time and/or right-time (active)
How are organizations using the information from data
warehouses?
• Business decision making
• Increasing customer focus, which includes the analysis of customer buying patterns
• Repositioning products and managing product portfolios by comparing performance of
sales by quarters, by years by geographic areas and many more… in order to fine tune
production strategies.
• Analyzing operations and looking for sources of profits
• Managing customer relationships, making environmental corrections and managing the
cost of corporate assets
• Traditional heterogeneous DB integration
The Importance of Data Warehousing

• Provide a “single version of the truth”


• Improve decision making
• Support key corporate initiatives such as performance management,
B2C and B2B e-commerce, and customer relationship management
Warehouse Users
• Analysts
• Managers
• Executives
• Operational personnel
• Customers and suppliers
Operational Data Store

 An operational data store consolidates data from


multiple source systems and provides a near real-
time, integrated view of volatile, current data.

 Its purpose is to provide integrated data for


operational purposes. It has add, change, and delete
functionality.

 It may be created to avoid a full blown ERP


implementation.
Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing)


• Major task of traditional relational DBMS
• Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
• Major task of data warehouse system
• Data analysis and decision making
OLTP vs. OLAP
Why Separate Data Warehouse?
• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation, summarization.
• DBMS supports concurrent processing of multiple transactions
hence concurrency control and locking mechanism are required
whereas in an OLAP operations such concurrency control or
locking mechanism is not required as it often need read based
access.
• DBMS is not required for decision making hence historical data
is not stored where as data warehouse is used for making
decision.
Data Mart
A departmental data warehouse that stores only
relevant data

• Dependent data mart


A subset that is created directly from a data warehouse

• Independent data mart


A small data warehouse designed for a strategic
business unit or a department
Machine Learning
Session 3
Why “Learn” ?
• Machine learning is programming computers to
optimize a performance criterion using example
data or past experience.

80
Human expnertise Humans are unable
does not exist to explain their
expertise
Solution needs to be adapted to
Solution changes in time particular cases
Models must be customized Models are based on
huge amount of data
Digit Recognition
Example
What is Machine Learning?

Optimize a performance
criterion using example data
or past experience.
88
Types of Machine Learning Machine
Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning

Classification
Regression
• Association
• Clustering Robot Movement
Supervised Learning: Uses

Prediction of future cases: Use the rule to predict the


output for future inputs
Knowledge extraction: The rule is easy to understand
Compression: The rule is simpler than the data it
explains
Outlier detection: Exceptions that are not covered by
the rule, e.g., fraud
90
Supervised Learning
Unsupervised Learning
• Learning “what normally happens”
• No output
• Clustering: Grouping similar instances
• Example applications
• Customer segmentation in CRM
• Image compression: Color quantization
• Bioinformatics: Learning motifs
92
Unsupervised Learning
Reinforcement Learning
•Learning a policy: A sequence of outputs
•No supervised output but delayed reward
•Credit assignment problem
•Game playing
•Robot in a maze

94
Reinforcement Learning
Association rule Mining
Session 4
Association rule mining -Agrawal et al in 1993.

Study of “what goes with what”


• “Customers who bought X also bought Y”
• What symptoms go with what diagnosis

Initially used for Market Basket Analysis to find how items


purchased by customers are related.

Also called affinity analysis

CS583, Bing Liu, UIC 97


An Example

Bread  Milk [sup = 5%, conf = 100%]


Used in many recommender systems
Applications – (1)
• Items = products;
• Baskets = sets of products someone bought in one trip to the
store
• Real market baskets: Chain stores keep TBs of data about what
customers buy together
• Tells how typical customers navigate stores, lets them position
tempting items
• Suggests tie-in “tricks”, e.g., run sale on diapers
and raise the price of beer
• Need the rule to occur frequently, or no $$’s
• Amazon’s people who bought X also bought Y

100
Applications – (2)
• Baskets = sentences;
• Items = documents containing those sentences
• Items that appear together too often could represent plagiarism
• Notice items do not have to be “in” baskets

101
Applications(3)

• Baskets = patients;
• Items = drugs & side-effects
• Has been used to detect combinations
of drugs that result in particular side-effects
• But requires extension: Absence of an item
needs to be observed as well as presence

102
Applications – (4)

• Finding communities in graphs (e.g., Twitter)


• Baskets = nodes; Items = outgoing neighbors
• Searching for complete bipartite subgraphs Ks,t of a big graph
• How?
• View each node i as a
basket Bi of nodes i it points to
• Ks,t = a set Y of size t that occurs in
t nodes
s nodes

s buckets Bi
• Looking for Ks,t  set of support s

and look at layer t – all frequent


sets of size t
A dense 2-layer graph
103
Result of Python
Rule: light cream -> chicken Support: 0.004532728969470737 Confidence:
0.29059829059829057 Lift: 4.84395061728395
=====================================
Rule: mushroom cream sauce -> escalope Support: 0.005732568990801126
Confidence: 0.3006993006993007 Lift: 3.790832696715049
=====================================
Rule: escalope -> pasta Support: 0.005865884548726837 Confidence:
0.3728813559322034 Lift: 4.700811850163794
=====================================
Rule: ground beef -> herb & pepper Support: 0.015997866951073192
Confidence: 0.3234501347708895 Lift: 3.2919938411349285
=====================================
Result of Python
light cream -> chicken
Support: 0. 004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
Terms
“IF”
part = antecedent
“THEN” part = consequent

“Item set” = the items (e.g., products)


comprising the antecedent or consequent

⚫ Antecedent and consequent are disjoint


(i.e., have no items in common)
Measures of Rule Performance

Support is how frequently an item appears in the


dataset.
It is defined as the fraction of the transaction T
that contains the itemset X. If there are X
datasets, then for transactions T, it can be
written as:

Support(X) = Frequency (X)/T


Measures of Rule Performance
Confidence: the % of antecedent
transactions that also have the
consequent item set
• Confidence of association rule is the probability of j given I = {i1,…,ik}

support( I  j )
conf( I  j ) 
support( I )
Lift
Lift defines the strength of a rule It is the ratio of the
observed support measure and expected support if X and Y
are independent of each other.

Support (A and B)
ift (A B) = ---------------------------------------------
Support(A) X Support(B)

|D| = Total number of transactions

109
Interpretation of Lift

• If Lift= 1: The probability of occurrence of antecedent


and consequent is independent of each other.
• Lift>1: It determines the degree to which the two
itemsets are dependent to each other.
• Lift<1: It tells us that one item is a substitute for other
items, which means one item has a negative effect on
another.

110
Interpretation
⚫ Support measures overall impact
⚫Confidence shows the rate at which consequents
will be found (useful in learning costs of
promotion)
⚫Lift ratio show how effective the rule is in
finding consequents (useful if finding particular
consequents is important)
Frequent Item Sets
⚫Ideally, we want to create all possible combinations of
items

⚫Problem: computation time grows exponentially as #


items increases

⚫Solution: consider only “frequent item sets”

⚫Criterion for frequent: support


Types of rules
• Inexplicable –When a new home depot store opens in an area, the
most commonly sold items are Washing Powder.

• Trivial –Croma customers who purchase TV also purchase Disc


Connection.

• Useful – On Saturday, Hyper city customers purchase diapers and


beer together.

113
Interesting Association Rules
• Not all high-confidence rules are interesting
• The rule X → milk may have high confidence for many
itemsets X, because milk is just purchased very often
(independent of X) and the confidence will be high

114
Association Rule
Total Number of
Transactions = 2000
TEA MILK

600 1000 Support (Tea) = 40%


200 Support (MILK) = 60%

Confidence (Tea - >


Milk) = 33.3%
Confidence (Milk ->
Tea) = 20%
Association Rule
Total Number of
Transactions = 2000
TEA
MILK Support (Tea) = 5%
10
1900 Support (MILK) = 95%
90

Confidence (Tea - >


Milk) = 90%
Negative Itemsets
Find infrequent itemsets for negative association rules.
A=>B does not imply ~B=>~A
To find negative association rules, we need to find
infrequent itemsets first.
A rule of the for A=>~B means
• supp(AU~B)≥ms
• supp(AUB)= supp(A)-supp(AU~B)
Only if both A and B are frequent,
will A => ~B be considered.
Negative Rules

• People who Buy does not buy

• People who buy will not buy


Negative Rules
Bear No Bear

Diaper 5 4 9

No Diaper 6 3 9

11 7 18
Negative Rules
Support(Bear, Diaper) =5/18

Conf(Bear, Diaper) = 0.45

Correlation(Brat, Diaper) = n(Bear and Diaper)


-----------------------------
N(Bear)*N(diaper)
= 5/9*11=0.05
=
Generating Frequent Item Sets
For k products…
1. User sets a minimum support criterion
2. Next, generate list of one-item sets that meet the
support criterion
3. Use the list of one-item sets to generate list of
two-item sets that meet the support criterion
4. Use list of two-item sets to generate list of three-
item sets
5. Continue up through k-item sets
Example: Frequent Itemsets
• Items = {milk, tea, pepsi, beer, juice}
• Support threshold = 3 baskets
B1 = {m, t, b} B2 = {m, p, j}
B3 = {m, b} B4 = {t, j}
B5 = {m, p, b} B6 = {m, t, b, j}
B7 = {t, b, j} B8 = {b, t}

• Frequent itemsets: {m}, {t}, {b}, {j},

{m,b} , {b,t} , {t,j}.


122
Measures of Rule Performance

Interest of an association rule A -> B:


difference between its confidence and the fraction of
baskets that contain j

Interest (A -> B) = Confidence (A -> B) – support (B)


Example: Confidence and Interest

B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

• Association rule: {m, b} →c


• Confidence = 2/4 = 0.5
• Interest = |0.5 – 5/8| = 1/8
• Item c appears in 5/8 of the baskets
• Rule is not very interesting!

124
Measures of Rule Performance

Leverage(A ->B) = P(A ->B) – (P(A) x P(B))

Leverage = 0 when the two items are independent. It


ranges from -1 (antecedent and consequent are
antagonistic) to +1 (antecedent makes consequent
more likely)
Conviction
Conviction is a measure that helps to judge if the
rule happened to be there by chance or not. It is
an alternative to confidence.

Conviction(A -> B) = 1- support(B)


-------------------------
1 - Confidence (A -> B)
Cluster Analysis for
Segmentation
Session 5
•How cluster analysis can be adopted for
Segmentation?
•What the different types of distance
measures used for cluster analysis?
•What are the different ways of identifying
the optimal number of clusters?
Segmentation for Marketing
Segmentation is a way of organizing customers into
groups with similar traits, product preferences, or
expectations. Once segments are identified,
marketing messages and in many cases even
products can be customized for each segment. The
better the segment(s) chosen for targeting by a
particular organization, the more successful the
organization is assumed to be in the marketplace.
Segments are constructed on the basis of
customers

demographic characteristics
psychographics
desired benefits from products/services and
past-purchase and product-use behaviors.
Cluster Analysis
• Cluster analysis is a class of statistical techniques that can be applied
to data that exhibit natural groupings.
• Cluster analysis makes no distinction between dependent and
independent variables. The entire set of interdependent
relationships is examined.
• Cluster analysis sorts through the raw data on customers and groups
them into clusters. A cluster is a group of relatively homogeneous
customers.
Basic step of Cluster Analysis

• Formulate the problem—select the variables you want to use as the


basis for clustering.
• Compute the distance between customers along the selected
variables.
• Apply the clustering procedure to the distance measures.
• Decide on the number of clusters.
• Map and interpret clusters—draw conclusions—illustrative techniques
like perceptual maps are useful.
Distance Measures
•Euclidean Measure
•Jaccard similarity
•Cosine coefficient
•Dice Coefficient
K-Means Clustering Algorithm
•K-means clustering belongs to the
nonhierarchical class of clustering algorithms.
•It is one of the more popular algorithms used for
clustering in practice because of its simplicity
and speed.
K-Means Clustering Algorithm
• It is considered to be more robust to different types of
variables, is more appropriate for large datasets that are
common in marketing, and is less sensitive to some
customers who are outliers (in other words, extremely
different from others).
• For K-means clustering, the user has to specify the number
of clusters required before the clustering algorithm is
started.
Algorithm
1. Choose the number of clusters, k.
2. Generate k random points as cluster centroids.
3. Assign each point to the nearest cluster centroid.
4. Recompute the new cluster centroid.
5. Repeat the two previous steps until some convergence criterion is met.
Usually the convergence criterion is that the assignment of customers to
clusters has not changed over multiple iterations.
• A cluster centroid is simply the average of all the points in that cluster. Its
coordinates are the arithmetic mean for each dimension separately over
all the points in the cluster. Consider Joe, Sam, and Sara in the previous
example. Let’s represent them based on their importance ratings on
Premium Savings and Neighborhood Agent as: Joe = {4,7}, Sam = {3,4},
Sara = {5,3}. If you assume that they belong to the same cluster, then the
center for their cluster is obtained as:

• Cluster centroid Z = (z1,z2) = {(4+3+5)/3, (7+4+3)/3}.


Number of cluster
• The elbow criterion states that you should choose a number of clusters so
that adding another cluster does not add sufficient information. The elbow
is identified by plotting the ratio of the within cluster variance to between
cluster variance against the number of clusters. The within cluster variance
is an estimate of the average of the variance in the variables used as a basis
for segmentation (Importance Score ratings for Premium Savings and
Neighborhood Agent in the Geico example) among customers who belong
to a particular cluster. The between cluster variance is an estimate of the
variance of the segmentation basis variables between customers who
belong to different segments.
• The objective of cluster analysis (as mentioned
before) is to minimize the within cluster variance and
maximize the between cluster variance. Therefore, as
the number of clusters is increasing, the ratio of the
within cluster variance to the between cluster
variance will keep decreasing.
• It should also be noted that the initial assignment of cluster seeds has a
bearing on the final model performance. Some common methods for
ensuring the stability of the results obtained from K-means clustering
include:
•  Running the algorithm multiple times with different starting values. When
using random starting points, running the algorithm multiple times will
ensure a different starting point each time.
•  Splitting the data randomly into two halves and running the cluster
analysis separately on each half. The results are robust and stable if the
number of clusters and the size of different clusters are similar in both
halves.
Profiling Clusters
• Once clusters are identified, the description of the clusters in terms of the variables
used for clustering— or using additional data such as demographics—helps to
customize marketing strategy for each segment. This process of describing the clusters
is called profiling. Figure 1 is an example of such a process. A good deal of cluster-
analysis software also provides information on which cluster a customer belongs to.
This information can be used to calculate the means of the profiling variables for each
cluster. In the Geico example, it is useful to investigate whether the segments also differ
with respect to demographic variables such as age and income. In Table 3, consider the
distribution of age and income for Segments A, B, and C as provided in Figure 1.
Applications of Cluster Analysis
• Understanding Discovered Clusters Industry Group

• Group related documents


Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
for browsing, group genes Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Technology1-DOWN

and proteins that have Sun-DOWN


Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

similar functionality, or 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,


Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN

group stocks with similar Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN


Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
price fluctuations 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

• Summarization
• Reduce the size of large
data sets

Clustering precipitation in
Australia

144
Clustering for Data Understanding and
Applications

• Biology: taxonomy of living things:


kingdom, phylum, class, order, family,
genus and species
• Information retrieval: document
clustering
• Land use: Identification of areas of similar
land use in an earth observation database
145
Clustering Applications
• Marketing: Help marketers discover distinct groups
in their customer bases, and then use this
knowledge to develop targeted marketing
programs
• City-planning: Identifying groups of houses
according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake
epicenters should be clustered along continent
faults 146
Clustering Application in Business

1. Identifying Fake News


2. Spam filter
3. Marketing and Sales
4. Identifying fraudulent or criminal activity
5. Document Analysis
6. Community detection

147
Types of Clusterings
• Partitional Clustering
•A division data objects into non-overlapping
subsets (clusters) such that each data object is in
exactly one subset
• Hierarchical clustering
•A set of nested clusters organized as a hierarchical
tree

148
Partitional Clustering (Bölümsel Kümeleme)

Original Points A Partitional


Clustering
149
Hierarchical Clustering
(Hiyerarşik Kümeleme)
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram

150
Clustering Algorithms
• K-means and its variants

• Hierarchical clustering

151
Hard vs. soft clustering
• Hard clustering: Each item belongs to exactly one cluster
• More common and easier to do
• Soft clustering: An item can belong to more than one cluster.
• Makes more sense for applications like creating browsable
hierarchies
• You may want to put a pair of sneakers in two clusters: (i)
sports apparel and (ii) shoes
• You can only do that with a soft clustering approach.
Hierarchical Clustering

•Hierarchical Clustering is of two types


•Agglomerative
•Divisive
Commonly used measure for hierarchical clustering
techniques are
Single Linkage based
Complete Linkage based
Average Linkage Based
Agglomerative

Divisive
A Simple example showing the implementation of k-
means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids we
compute the Euclidean
distance of each object, as
shown in table.

• Therefore, the new clusters


are:
{1,2} and {3,4,5,6,7}

• Next centroids are:


m1=(1.25,1.5) and m2 =
(3.9,5.1)
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

• Therefore, there is no change


in the cluster.
• Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
Limitations of K-means
• K-means has problems when clusters are of differing
• Sizes
• Densities
• Non-globular shapes

• K-means has problems when the data contains


outliers.

164
Supervised Learning
Session 6
Supervised Learning

Supervised Laerning

Classification Prediction Generative


Supervised Learning : Important Concepts

• Data: labeled instances <xi, y>, e.g. emails


marked spam/not spam
• Training Set
• Test Set

Training Set Test Set

167
Example: Spam Filter

168
Classification Examples
• OCR (input: images, classes: characters)
• Medical diagnosis (input: symptoms, classes: diseases)
• Automatic essay grader (input: document, classes: grades)
• Fraud detection (input: account activity, classes: fraud / no
fraud)
• Customer service email routing
• Recommended articles in a newspaper, recommended
books
• Financial investments
169
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

170
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAM E RANK YEARS TENURED Classifier


(Model)
M ike Assistant Prof 3 no
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no THEN tenured = ‘yes’
Data Mining: Concepts and Techniques 171
Classification Process (2): Use the
Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof 7 yes
Data Mining: Concepts and Techniques 172
Classification Technique

Tree Distance Probability Neural Netwo

173
Decision Tree
Working of Decision tree
• Select the best attribute using Attribute Selection Measures (ASM) to
split the records.
• Make that attribute a decision node and breaks the dataset into smaller
subsets.
• Start tree building by repeating this process recursively for each child
until one of the conditions will match:
• All the tuples belong to the same attribute value.
• There are no more remaining attributes.
• There are no more instances.
Attribute Selection
•Information Gain (Entropy)
•Gain Ratio
•Gini index
Logistic Regression
Confusion metric
• A confusion matrix is defined as the table that is
often used to describe the performance of a
classification model on a set of the test data for
which the true values are known.
• True Positive: We predicted positive and it’s true. In the image, we
predicted that a woman is pregnant and she actually is.
• True Negative: We predicted negative and it’s true. In the image,
we predicted that a man is not pregnant and he actually is not.
• False Positive (Type 1 Error): We predicted positive and it’s false. In
the image, we predicted that a man is pregnant but he actually is
not.
• False Negative (Type 2 Error): We predicted negative and it’s false.
In the image, we predicted that a woman is not pregnant but she
actually is.
Classification Metric
• Accuracy - - Accuracy simply measures how often the
classifier correctly predicts. We can define accuracy as
the ratio of the number of correct predictions and the
total number of predictions.
• Precision explains how many of the correctly
predicted cases actually turned out to be
positive. Precision is useful in the cases where
False Positive is a higher concern than False
Negatives. The importance of Precision is in music
or video recommendation systems, e-commerce
websites, etc. where wrong results could lead to
customer churn and this could be harmful to the
business.
• Recall (sensitivity) explains how many of the actual
positive cases we were able to predict correctly with
our model. Recall is a useful metric in cases where False
Negative is of higher concern than False Positive. It is
important in medical cases where it doesn’t matter
whether we raise a false alarm but the actual positive
cases should not go undetected!
•F1 Score gives a combined idea about
Precision and Recall metrics. It is
maximum when Precision is equal to
Recall.
When F1 score is useful

• When FP and FN are equally costly.


• Adding more data doesn’t effectively change
the outcome
• True Negative is high
ROC Curve
• An ROC curve (receiver operating characteristic
curve) is a graph showing the performance of a
classification model at all classification
thresholds. This curve plots two parameters:
True Positive Rate and False Posirive rate
https://teachablemachine.withgoogle.com/

You might also like