You are on page 1of 523

Data Warehousing and Mining

Lecture 1

• Course syllabus
• Overview of data warehousing and mining

Lecture slides modified from:


– Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
– Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)
– Joshep Fong (http://www.cs.cityu.edu.hk/~jfong/course/cs5483/)
– Slobodan Vucetic (http://www.ist.temple.edu/~vucetic/cis526fall2004.htm)
1
Course Syllabus
Textbook:
1- J. Han, M. Kamber, Data Mining: Concepts and Techniques,
2006,second edition.
2- Joseph Fong: Information Systems Reengineering and
Integration, 2006, ISBN 978-1-84628-382-6, Second edition.
Topics:
– Overview of data warehousing and mining
– Data warehouse and OLAP technology
– Data preprocessing
– Mining association rules
– Classification and prediction
– Cluster analysis
– Mining complex types of data
Grading:
– (10%) Homework Assignments and Quizs
– (30%) Project
– (60%) Final Exam.
2
Course Syllabus
Late Policy and Academic Honesty:
The projects and homework assignments are due in class, on the specified due
date. NO LATE SUBMISSIONS will be accepted. For fairness, this policy will be
strictly enforced.
Academic honesty is taken seriously. You must write up your own solutions and
code. For homework problems or projects you are allowed to discuss the
problems or assignments verbally with other class members. You MUST
acknowledge the people with whom you discussed your work. Any other sources
(e.g. Internet, research papers, books) used for solutions and code MUST also
be acknowledged. In case of doubt PLEASE contact the instructor.

Disability Disclosure Statement


Any student who has a need for accommodation based on the impact of a disability
should contact me privately to discuss the specific situation as soon as possible.
Contact Disability Resources and Services at 215-204-1280 in 100 Ritter Annex
to coordinate reasonable accommodations for students with documented
disabilities.

3
Motivation:
“Necessity is the Mother of Invention”

• Data explosion problem

– Automated data collection tools and mature database technology


lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories

• We are drowning in data, but starving for knowledge!

• Solution: Data warehousing and data mining

– Data warehousing and on-line analytical processing

– Extraction of interesting knowledge (rules, regularities, patterns,


constraints) from data in large databases
4
Why Mine Data? Commercial Viewpoint

• Lots of data is being collected


and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions

• Computers have become cheaper and more powerful


• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
5
Why Mine Data? Scientific Viewpoint

• Data collected and stored at


enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw
data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation

6
What Is Data Mining?

• Data mining (knowledge discovery in databases):


– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases

• Alternative names and their “inside stories”:


– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, business intelligence, etc.

7
Examples: What is (not) Data Mining?

What is not Data What is Data Mining?


Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US locations
directory (O’Brien, O’Rurke, O’Reilly… in
Boston area)

– Query a Web – Group together similar


documents returned by search
search engine for
engine according to their context
information about
(e.g. Amazon rainforest,
“Amazon”
Amazon.com,)
8
Data Mining: Classification Schemes

• Decisions in data mining


– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted

• Data mining tasks


– Descriptive data mining
– Predictive data mining
9
Decisions in Data Mining

• Databases to be mined
– Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media, heterogeneous,
legacy, WWW, etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc. 10
Data Mining Tasks

• Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
• Description Tasks
– Find human-interpretable patterns that describe the data.

Common data mining tasks


– Classification [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
– Regression [Predictive]
– Deviation Detection [Predictive]
11
Classification: Definition

• Given a collection of records (training set )


– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.

12
Classification Example

Tid Refund Marital Taxable Refund Marital Taxable


Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
7 Yes Divorced 220K
10

Set
No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10

Set Classifier
13
Classification: Application 1

• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.

14
Classification: Application 2

• Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.

15
Classification: Application 3

• Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be
lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the
day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.

16
Classification: Application 4

• Sky Survey Cataloging


– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!

17
Classifying Galaxies

Class: Attributes:
Early
• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
18
Clustering Definition

• Given a set of data points, each having a set of


attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to one
another.
– Data points in separate clusters are less similar to
one another.
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.

19
Illustrating Clustering

Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized

20
Clustering: Application 1

• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.

21
Clustering: Application 2

• Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.

22
Association Rule Discovery: Definition

• Given a set of records each of which contain some


number of items from a given collection;
– Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk
{Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

23
Association Rule Discovery: Application 1

• Marketing and Sales Promotion:


– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which
products would be affected if the store discontinues
selling bagels.
– Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
24
Association Rule Discovery: Application 2

• Supermarket shelf management.


– Goal: To identify items that are bought together by
sufficiently many customers.
– Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among
items.
– A classic rule --
• If a customer buys diaper and milk, then he is very likely to
buy beer:

25
Regression

• Predict a value of a given continuous valued variable


based on the values of other variables, assuming a
linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on advetising
expenditure.
– Predicting wind velocities as a function of temperature, humidity,
air pressure, etc.
– Time series prediction of stock market indices.

26
Deviation/Anomaly Detection

• Detect significant deviations


from normal behavior
• Applications:
– Credit Card Fraud Detection

– Network Intrusion
Detection

27
Data Mining and Induction Principle

Induction vs Deduction

• Deductive reasoning is truth-preserving:


1. All horses are mammals
2. All mammals have lungs
3. Therefore, all horses have lungs

• Induction reasoning adds information:


1. All horses observed so far have lungs.
2. Therefore, all horses have lungs.

28
The Problems with Induction

From true facts, we may induce false models.

Prototypical example:
– European swans are all white.
– Induce: ”Swans are white” as a general rule.
– Discover Australia and black Swans...
– Problem: the set of examples is not random and representative

Another example: distinguish US tanks from Iraqi tanks


– Method: Database of pictures split in train set and test set;
Classification model built on train set
– Result: Good predictive accuracy on test set;Bad score on
independent pictures
– Why did it go wrong: other distinguishing features in the pictures
(hangar versus desert)
29
Hypothesis-Based vs. Exploratory-Based

• The hypothesis-based method:


– Formulate a hypothesis of interest.
– Design an experiment that will yield data to test this hypothesis.
– Accept or reject hypothesis depending on the outcome.

• Exploratory-based method:
– Try to make sense of a bunch of data without an a priori
hypothesis!
– The only prevention against false results is significance:
• ensure statistical significance (using train and test etc.)
• ensure domain significance (i.e., make sure that the results make
sense to a domain expert)

30
Data Mining: A KDD Process

Pattern Evaluation
– Data mining: the core of
knowledge discovery Data Mining
process.
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse

Data Cleaning
Data Integration

Databases 31
Steps of a KDD Process

• Learning the application domain:


– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
32
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP 33
Data Mining: On What Kind of Data?

• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW 34
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines
35
Data Mining vs. Statistical Analysis

Statistical Analysis:
• Ill-suited for Nominal and Structured Data Types
• Completely data driven - incorporation of domain knowledge not
possible
• Interpretation of results is difficult and daunting
• Requires expert user guidance

Data Mining:
• Large Data sets
• Efficiency of Algorithms is important
• Scalability of Algorithms is important
• Real World Data
• Lots of Missing Values
• Pre-existing data - not user generated
• Data not static - prone to updates
• Efficient methods for data retrieval available for use 36
Data Mining vs. DBMS

• Example DBMS Reports


– Last months sales for each service type
– Sales per service grouped by customer sex or age
bracket
– List of customers who lapsed their policy

• Questions answered using Data Mining


– What characteristics do customers that lapse their
policy have in common and how do they differ from
customers who renew their policy?
– Which motor insurance policy holders would be
potential customers for my House Content Insurance
policy?
37
Data Mining and Data Warehousing

• Data Warehouse: a centralized data repository which


can be queried for business benefit.
• Data Warehousing makes it possible to
– extract archived operational data
– overcome inconsistencies between different legacy data formats
– integrate data throughout an enterprise, regardless of location,
format, or communication requirements
– incorporate additional or expert information
• OLAP: On-line Analytical Processing
• Multi-Dimensional Data Model (Data Cube)
• Operations:
– Roll-up
– Drill-down
– Slice and dice
– Rotate
38
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


Layer1
Data cleaning Data
Databases Data
Data integration
Warehouse 39
Repository
DBMS, OLAP, and Data Mining

DBMS OLAP Data Mining


Knowledge discovery
Extraction of detailed Summaries, trends and
Task of hidden patterns
and summary data forecasts
and insights

Type of result Information Analysis Insight and Prediction

Multidimensional data Induction (Build the


Deduction (Ask the
modeling, model, apply it to
Method question, verify
Aggregation, new data, get the
with data)
Statistics result)

What is the average Who will buy a


Who purchased
income of mutual mutual fund in the
Example question mutual funds in
fund buyers by next 6 months and
the last 3 years?
region by year? why?

40
Example of DBMS, OLAP and Data
Mining: Weather Data

DBMS:
Day outlook temperature humidity windy play

1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
14 rainy 71 91 true no 41
Example of DBMS, OLAP and Data
Mining: Weather Data

• By querying a DBMS containing the above table we may


answer questions like:
• What was the temperature in the sunny days? {85, 80,
72, 69, 75}
• Which days the humidity was less than 75? {6, 7, 9, 11}
• Which days the temperature was greater than 70? {1, 2,
3, 8, 10, 11, 12, 13, 14}
• Which days the temperature was greater than 70 and the
humidity was less than 75? The intersection of the above
two: {11}

42
Example of DBMS, OLAP and Data
Mining: Weather Data
OLAP:
• Using OLAP we can create a Multidimensional Model of our data
(Data Cube).
• For example using the dimensions: time, outlook and play we can
create the following model.

9/5 sunny rainy overcast

Week 1 0/2 2/1 2/0

Week 2 2/1 1/1 2/0

43
Example of DBMS, OLAP and Data
Mining: Weather Data
Data Mining:

• Using the ID3 algorithm we can produce the following


decision tree:

• outlook = sunny
– humidity = high: no
– humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
– windy = true: no
– windy = false: yes

44
Major Issues in Data Warehousing and
Mining
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
45
Major Issues in Data Warehousing and
Mining
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing knowledge:
A knowledge fusion problem
– Protection of data security, integrity, and privacy

46
Multi-Tiered Architecture of data warehousing and data mining

Monitor
& OLAP Server
other Metadata
sources Integrator

Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Integration
Warehouse Data mining
Refresh

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


2008/1/9 1
What is XML?

•EXtensible Markup Language (XML)


•World Wide Web Consortium (W3C) recommendation Version 1.0 as of
10/02/1998.
•Describes data, rather than instructing a system on how to process it.
•Provides powerful capabilities for data integration and data-driven
styling.
•Introduces new processing paradigms and requires new ways of thinking
about Web development.
•A Meta-Markup Language, a set of rules for creating semantic tags used
to describe data.

2008/1/9 2
XML is Extensible
• The tags used to markup HTML documents and
the structure of HTML documents are
predefined.

• The author of HTML documents can only use


tags that are defined in the HTML standard.

• XML allows the author to define his own tags


and his own document structure.
2008/1/9 3
Benefits of using XML
• It is structured.
• Documents are easily committed to a persistence
layer.
• Platform independent, textual information.
• An open standard.
• Language independent.
• DOM and SAX are open, language-independent
set of interfaces.
• It is Web enabled.
2008/1/9 4
Typical XML System
XML Document
(Content)
XML
Parser XML Application
(Processor)
XML
DTD
(Rules)
•XML Document (content)
•XML Document Type Definition - DTD (structure definition; this is an
operational part)
•XML Parser (conformity checker)
•XML Application (uses the output of the Parser to achieve your unique
objectives)
2008/1/9 5
How XML can be used?
• XML can keep data separated from your
HTML document.
• XML can also store data inside HTML
documents (Data Islands).
• XML can be used to exchange data.
• XML can be used to store data.

2008/1/9 6
XML Syntax
• An example XML document.

<?xml version="1.0"?>
<note>
<to>Tan Siew Teng</to>
<from>Lee Sim Wee</from>
<heading>Reminder</heading>
<body>Don't forget the Golf Championship this
weekend!</body>
</note>
2008/1/9 7
Example (cont’d)
• The first line in the document: The XML
declaration should always be included.
• It defines the XML version of the document.
• In this case the document conforms to the 1.0
specification of XML.
<?xml version="1.0"?>
• The next line defines the first element of the
document (the root element):
<note>

2008/1/9 8
Example (cont’d)
• The next lines defines 4 child elements of the root
(to, from, heading, and body):

<to>Tan Siew Teng</to>


<from>Lee Sim Wee</from>
<heading>Reminder</heading>
<body>Don't forget the Golf Championship this
weekend!</body>

• The last line defines the end of the root element:

</note>
2008/1/9 9
What is an XML element?
• An XML element is made up of a start tag, an
end tag, and data in between.
<Sport>Golf</Sport>
• The name of the element is enclosed by the less
than and greater than characters, and these are
called tags.
• The start and end tags describe the data within the
tags, which is considered the value of the
element.
• For example, the following XML element is a
<player> element with the value “Tiger Wood.”
<player>Tiger Wood</player>
2008/1/9 10
There are 3 types of tags
• Start-Tag
– In the example <Sport> is the start tag. It defines
type of the element and possible attribute
specifications
<Player firstname=“Wood" lastname=“Tiger">
• End-Tag
– In the example </Sport> is the end tag. It
identifies the type of element that tag is ending.
Unlike start tag end tag cannot contain attribute
specifications.
• Empty Element Tag
– Like start tag this has attribute specifications but it
does not need an end tag. Denotes that element is
empty (does not have any other elements). Note
2008/1/9 the symbol for ending tag '/' before '> ' 11
<Player firstname=“Wood" lastname=“Tiger"/>
XML elements must have a closing
tag
In HTML some elements do not have to have a closing tag.

The following code is legal in HTML:


<p>This is a paragraph
<p>This is another paragraph

In XML all elements must have a closing tag like this:


<p>This is a paragraph</p>
<p>This is another paragraph</p>

2008/1/9 12
Rules for Naming Elements
• XML names should start with a letter or the
underscore character.
• Rest of the name can contain letters, digits,
dots, underscores or hyphens.
• No spaces in names are allowed.
• Names cannot start with 'xml' which is a
reserved word.

2008/1/9 13
XML tags are case sensitive
• XML tags are case sensitive. The tag <Message> is
different from the tag <message>.

• Opening and closing tags must therefore be


written with the same case:

<message>This is correct</message>

<Message>This is incorrect</message>

2008/1/9 14
All XML elements must be
properly nested
In HTML some elements can be improperly nested within each other
like this:

<b><i>This text is bold and italic</b></i>

In XML all elements must be properly nested within each other like this

<b><i>This text is bold and italic</i></b>

2008/1/9 15
XML documents must have a root tag
• Documents must contain a single tag pair to
define the root element.
• All other elements must be nested within the root
element.
• All elements can have sub (children) elements.
• Sub elements must be in pairs and correctly
nested within their parent element:
<root>
<child>
<subchild>
</subchild>
</child>
</root>
2008/1/9 16
XML Attributes
• XML attributes are normally used to describe
XML elements, or to provide additional
information about elements.
• An element can optionally contain one or more
attributes. An attribute is a name-value pair
separated by an equal sign (=).
• Usually, or most common, attributes are used to
provide information that is not a part of the
content of the XML document.
• Often the attribute data is more important to the
XML parser than to the reader.
2008/1/9 17
XML Attributes (cont’d)
• Attributes are always contained within the start
tag of an element. Here are some examples:
<Player firstname=“Wood" lastname=“Tiger“ />
Player - Element Name
Firstname - Attribute Name
Wood - Attribute Value
• HTML examples:
<img src="computer.gif">
<a href="demo.asp">
• XML examples:
<file type="gif">
<person id="3344">
2008/1/9 18
Attribute values must always be
quoted
• XML elements can have attributes in name/value
pairs just like in HTML.
• An element can optionally contain one or more
attributes.
• In XML the attribute value must always be
quoted.
• An attribute is a name-value pair separated by an
equal sign (=).
<CITY ZIP="01085">Westfield</CITY>
• ZIP="01085" is an attribute of the <CITY> element.
2008/1/9 19
What is a Comment ?
• Comments are informational help for the
reader.

• These are ignored by XML processors.

• They are enclosed within "<!--" and "-->"


tags.
<!-- This is a comment -->
2008/1/9 20
What is a Processing Instruction ?
• Processing Instructions provide a way to send
instructions to computer programs or
applications. They are enclosed within "<?" and
"?>" tags.

<? xml:stylesheet type="text/xsl" href="styler.xsl" ?>

xml:stylesheet - Application name


type="text/xsl" href="styler.xsl" - Instructions to the
application

2008/1/9 21
What is a DTD ?
• Document Type Declaration (DTD) is a
mechanism (set of rules) to describe the
structure, syntax and vocabulary of XML
documents.

• It is a modeling language for XML but it


does not follow the same syntax as XML.

2008/1/9 22
Document Type Definition (DTD)
• Define the legal building blocks of an XML
document.
• Set of rules to define document structure with a
list of legal elements.
• Declared inline in the XML document or as an
external reference.
• All names are user defined.
• Derived from SGML.
• One DTD can be used for multiple documents.
• Has ASCII format.
• DOCTYPE keyword.
2008/1/9 23
Element Declaration
• Following lines show the possible syntaxes for
element declaration
<!ELEMENT reports (employee*)>
<!ELEMENT employee (ss_number, first_name,
middle_name, last_name, email, extension, birthdate,
salary)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT extension EMPTY>
#PCDATA - Parsed Character Data, meaning that
elements can contain text. This requirement means that
no child elements may be present in the element within
which #PCDATA is specified
EMPTY - Indicates that this is the leaf element, cannot
contain any more children
2008/1/9 24
Occurrence
• There are notations to specify the number of
occurrences that a child element can occur within
the parent element.
• These notations can appear at the end of each
child element
+ - Element can appear one or several times
* - Element can appear zero or several times
? - Element can appear zero or one time
nothing - Element can appear only once (also, it
must appear once)

2008/1/9 25
Separators
, - Elements on the right and left of
comma must appear in the same
order.

| - Only one of the elements on the


left or right of this symbol must
appear.
2008/1/9 26
Attribute Declaration
Syntaxes for attribute declaration
<!ATTLIST customer ID CDATA #REQUIRED>
<!ATTLIST customer Preferred (true | false) "false">
Customer - Element name
ID - Attribute type ID uniquely identifies an element
IDREF - Attribute with type IDREF point to elements with an
ID attribute
Preferred - Attribute names
(true | false) - Possible attribute values
False - Default attribute value
CDATA - Character data
#REQUIRED- Attribute value must be provided
#IMPLIED - If no value is provided, application must use its
own default
#FIXED - Attribute value must be the one that is
provided in DTD
NMTOKEN - Name token consists of letters, digits, periods, underscores,
2008/1/9 hyphens and colon characters 27
Why use a DTD?
• XML provides an application independent way of
sharing data.
• With a DTD, independent groups of people can
agree to use a common DTD for interchanging
data.
• Your application can use a standard DTD to
verify that data that you receive from the outside
world is valid.
• You can also use a DTD to verify your own data.
2008/1/9 28
Well-formed Document

<?xml version=“1.0”?>
<TITLE>
<title>A Well-formed Documents</title>
<first>
This is a simple
<bold>well-formed</bold>
document.
</first>
<TITLE> Source:
L1.xml
2008/1/9 29
Rules for Well-formed
Documents
• The first line of a well-formed XML document
must be an XML declaration.
• All non-empty elements must have start tags and
end tags with matching element names.
• All empty elements must end with />.
• All document must contain one root element.
• Nested elements must be completely nested
within their higher-level elements.
• The only reserved entity references are &amp;,
&apos;, &gt; &lt;, and &quot.
2008/1/9 30
DTD Graph
• Given the DTD information of the XML to be stored, we can
create a structure called the Data Type Definition Graph that
mirrors the structure of the DTD. Each node in the Data Type
Definition graph represents an XML element in rectangle, an XML
attribute in semi-cycle, and an operator in cycle. They are put
together in a hierarchical containment under a root element node,
with element nodes under a parent element node, separated by
occurrence indicator in cycle.

• Facilities are available to link elements together with an Identifier


(ID) and Identifier Reference (IDREF). An element with IDREF
refers to an element with ID. Each ID must have a unique address.
Nodes can refer to each other by using ID and IDREF such that
nodes with IDREF referring to nodes with ID.

2008/1/9 31
Extended Entity Relationship Model DTD Graph
Root

Entity A
+
1 1
R1 R4
Element E
1 1
n 1
n 1 n
Entity B Entity E
* *
1 1 1 1 Mapping
n
R2 R3 R3 R7 Element A Element F Element H

n n n n 1 1

Entity C Entity D Entity F Entity H * *


1
n n
R1
Element B Element G
n 1 1
Entity G * *
n n
Element C Element D
2008/1/9 32
An EER model for Customer Sales

Customer_no
Invoice_no Customer_name
Quantity Sex Year
Invoice_amount Postal_code Month
Invoice_date Telephone Quantity
Shipment_date Email Total
n 1 Monthly
Invoice R7 Customer 1
Sales
1
Item_no,
1 1 1
Item_name
Author
Publisher
R8 R1 R3 R2 R4 Item_price
n n n n
Invoice_no n n n
Item_no Customer_no Year Year,
Quantity 1
Invoice Customer Address Month Month
Item
Unit_price City
Address State Country
Customer Customer_no Item_no Item Sales R3 Item
Invoice_price Sales Quantity Quantity Total
Discount Is_default Total
n
1

R11

2008/1/9 33
A Sample DTD graph for Customer Sales
Root

+
Sales

* * * *
Invoice_no
Quantity Customer_no
Element Invoice_amount Element Customer_name Element Year
Invoice_date Sex Month
Shipment_date Postal_code Quantity
Invoice Shipment_type Customer Telephone Monthly Total
ID
Email Sales
idref ID Item_no,
Element Item_name
* * * *
Author
Publisher
Item_price
Item
Catalog_type
ID ID
Element Quantity Element Element
Unit_price Address Quantity Element
Invoice_price City idref Quantity Total
Invoice Discount Customer Customer Total
State Country Item Sales
Item Is_default Address Sales
idref
idref

2008/1/9 34
The mapped DTD from DTD Graph
<!ELEMENT Sales (Invoice*, Customer*, Item*, Monthly_sales*)>
<!ATTLIST Sales
Status (New | Updated | History) #required>
<!ELEMENT Invoice (Invoice_item*)>
<!ATTLIST Invoice
Invoice_no CDATA #REQUIRED
Quantity CDATA #REQUIRED
Invoice_amount CDATA #REQUIRED
Invoice_date CDATA #REQUIRED
Shipment_date CDATA #IMPLIED
Customer_idref IDREF #REQUIRED>
<!ELEMENT Customer (Customer_address*)>
<!ATTLIST Customer
Customer_id ID #REQUIRED
Customer_name CDATA #REQUIRED
Customer_no CDATA #REQUIRED
Sex CDATA #IMPLIED
Postal_code CDATA #IMPLIED
Telephone CDATA #IMPLIED
Email CDATA #IMPLIED>
<!ELEMENT Customer_address EMPTY>
<!ATTLIST Customer_address
Address_type (Home|Office) #REQUIRED
Address NMTOKENS #REQUIRED
City CDATA #IMPLIED
State CDATA #IMPLIED
Country CDATA #IMPLIED
Customer_idref
2008/1/16 IDREF #REQUIRED 35
Is_default (Y|N) “Y”>
<!ELEMENT Invoice_Item EMPTY>
<!ATTLIST Invoice_Item
Quantity CDATA #REQUIRED
Unit_price CDATA #REQUIRED
Invoice_price CDATA #REQUIRED
Discount CDATA #REQUIRED
Item_idref IDREF REQUIRED>
<!ELEMENT Item EMPTY>
<!ATTLIST Item
Item_id ID #REQUIRED
Item_name CDATA #REQUIRED
Author CDATA #IMPLIED
Publisher CDATA #IMPLIED
Item_price CDATA #REQUIRED>
<!ELEMENT Monthly_sales(Item_sales*, Customer_sales*)>
<!ATTLIST Monthly_sales
Year CDATA #REQUIRED
Month CDATA #REQUIRED
Quantity CDATA #REQUIRED
Total CDATA #REQUIRED>
<!ELEMENT Item_sales EMPTY>
<!ATTLIST Item_sales
Quantity CDATA #REQUIRED
Total CDATA #REQUIRED
Item_idref IDREF #REQUIRED>
<!ELEMENT Customer_sales EMPTY>
<!ATTLIST Customer_sales
Quantity CDATA #REQUIRED
Total CDATA #REQUIRED
Customer_idref
2008/1/9 IDREF #REQUIRED>
36
Review Question 1
What are the similarity and dissimilarity
between DTD and Well-formed Document?

2008/1/9 37
Tutorial question 1
Map the following Extended Entity Relationship Model into
an DTD Graph and a Document Type Definition (DTD)

Department Department_ID
1
Salary

has
n
Trip Trip_ID
1
taken
n
Staff_ID
Car_rental Car_model

An ER model for car rental

2008/1/9 38
Reading assignment
Chapter 3 Schema Translation in “Information
Systems Reengineering and Integration”
Second Edition, by Joseph Fong, published
by Springer, 2006, pp.142-154.

2008/1/9 39
Extended Entity Relationship (EER) Model

• ER model has been widely used but have


some shortcomings.
• Difficult to represent cases where an entity
may have varying attributes dependent upon
some property.
• ER model has been extended into Extended
Entity Relationship model which includes
more semantics such as generalization,
categorization and aggregation.
2023/2/5 1
Cardinality: One-to-one relationship
A one-to-one relationship between set A and set B is defined as: For all a in A, there exists at most one b
in B such that a and b are related, and vice versa.

Example
A president leads a nation.

Relational Model:
Relation President (President_name, Race, *Nation_name)
Relation Nation (Nation_name, Nation_size)
Where underlined are primary keys and "*" prefixed are foreign keys

Extended Entity Relationship model

2023/2/5 2
Cardinality: Many-to-one relationship
A many-to-one relationship from set A to set B is defined as: For all a in A, there exists at most one b in B
such that a and b are related, and for all b in B, there exists zero or more a in A such that a and b are related.

Example A director directs many movies.

Relational Model:
Relation Director (Director_name, Age)
Relation Movies (Movie_name, Sales_volume, *Director_name)

Extended entity relationship model:

2023/2/5 3
Cardinality: Many-to-many relationship
A many-to-many relationship between set A and set B is defined as: For all a in A, there exists zero
or more b in B such that a and b are related, and vice versa.

Example
Many students take many courses such that a student can take many courses and a course can be taken by
many students.

Relational Model:
Relation Student (Student_id, Student_name)
Relation Course (Course_id, Course_name)
Relation take (*Student_id, *Course_id)

Extended entity relationship model:

2023/2/5 4
Data Semantic: Is-a (Subtype) relationship
The relationship A isa B is defined as: A is a special kind of B.

Example
Father is Male.

Relational Model:
Relation Male (Name, Height)
Relation Father (*Name, Birth_date)

Extended entity relationship model

2023/2/5 5
Data Semantic: Disjoint Generalization
-Generalization is to classify similar entities into a single entity. More than one is-a
relationship can form data abstraction (i.e. super-class and subclasses) among entities.
- A subclass entity is a subset of its super-class entity. There are two kinds of generalization.
- The first is disjoint generalization such that subclass entities are mutually exclusive.
- The second is overlap generalization such that subclass entities can overlap each other.

Example of Disjoint Generalization


A refugee and a non-refugee can both be a boat
person, but a refugee cannot be a non-refugee, and
vice versa.

Relational Model:
Relation Boat_person (Name, Birth_date,
Birth_place)
Relation Refugee (*Name, Open_center)
Relation Non-refugee (*Name, Detention_center)

Extended entity relationship model:

2023/2/5 6
Data Semantic: Overlap Generalization
Example of Overlap Generalization
A computer programmer and a system analyst can both be a computer professional, and a computer
programmer can also be a system analyst, and vice versa.

Relational Model:
Relation Computer_professional (Employee_id, Salary)
Relation Computer_programmer (*Employee_id, Language_skill)
Relation System_analyst (*Employee_id, Application_system)

Extended entity relationship model:

2023/2/5 7
Data Semantic: Categorization Relationship
In cases the need arises for modeling a single super-class/subclass relationship with more than
one super-class (es), where the super-classes represent different entity types. In this case, we
call the subclass a category.

Relational Model:
Relation Department (Borrower_card, Department_id)
Relation Doctor (Borrower_card, Doctor_name)
Relation Hospital (Borrower_card, Hospital_name)
Relation Borrower (*Borrower_card, Return_date, File_id)

Extended Entity Relationship Model

2023/2/5 8
Data Semantic: Aggregation Relationship
Aggregation is a method to form a composite object from its components. It aggregates
attribute values of an entity to form a whole entity.

Example
The process of a student taking a course can form a composite entity (aggregation) that may be
graded by an instructor if the student completes the course.
Relational Model:
Relation Student (Student_no, Student_name)
Relation Course (Course_no, Course_name)
Relation Takes (*Student_no, *Course_no, *Instructor_name)
Relation Instructor (Instructor_name, Department)
Extended Entity Relationship Model

2023/2/5 9
Data Semantic: Total Participation
An entity is in total participation with another entity provided that all data occurrences of the
entity must participate in a relationship with the other entity.

Example
An employee must be hired by a department.

Relational Model:
Relation Department (Department_id, Department_name)
Relation Employee (Employee_id, Employee_name, *Department_id)

Extended entity relationship model:

2023/2/5 10
Data Semantic: Partial Participation
An entity is in partial participation with another entity provided that the data occurrences of the
entity are not totally participate in a relationship with the other entity.

Example
An employee may be hired by a department.
Relational Model:
Relation Department (Department_id, Department_name)
Relation Employee (Employee_no, Employee_name, &Department_id)
Where & means that null value is allowed

Extended entity relationship model:

2023/2/5 11
Data Semantic: Weak Entity
The existence of a weak entity depends on its strong entity.

Example
A hotel room must concatenate hotel name for identification.
Relational Model:
Relation Hotel (Hotel_name, Ranking)
Relation Room (*Hotel_name, Room_no, Room_size)

Extended entity relationship model

2023/2/5 12
Cardinality: N-ary Relationship
Multiple entities relate to each other in an n-ary relationship.

Example
Employees use a wide range of different skills on each project they are associated with.
Relational Model:
Relation Engineer (Employee_id, Employee_name)
Relation Skill (Skill_name, Years_experience)
Relation Project (Project_id, Start_date, End_date)
Relation Skill_used (*Employee_id, *Skill_name, *Project_id)

Extended entity relationship model:

2023/2/5 13
Architecture of multiple databases integration
Global
database
Relational
Database 1

: Step 3. Data Integration Look Up


Table
:

Relational
Database n Integrated
schema

Step 2. Schema Integration

EER Model ............ EER Model


1 n

Step 1. Reverse Engineering


(for Schema Translation)
............
Relational Relational
schema 1 Schema n
2023/2/5 14
Reverse engineering
relational schema into EER model

Step 1 Defining each relation, key and field:


1) The relations are preprocessed by making any
necessary candidate key substitutions as follows:
• Primary relation: describing entities.
– Primary relation - Type 1 (PR1). This is a relation whose
primary key does not contain a key of another relation.
– Primary relation - Type 2 (PR2). This is a relation whose
primary key does contain a key of another relation.

2023/2/5 15
• Secondary relation: a relation whose primary key is
fully or partially formed by concatenation of primary
keys of other relations.
– Secondary relation - Type 1 (SR1). If the key of the
secondary relation is formed by concatenation of primary
keys of primary relations, it is of Type 1 or SR1.
– Secondary relation - Type 2 (SR2). Secondary relations that
are not of Type 1.
• Key attribute - Primary (KAP). This is an attribute
in the primary key of a relation, and is also a
foreign key of another relation.
• Key attribute - General (KAG). These are all the
other primary key attributes in a secondary
relation that are not of the KAP type.
2023/2/5 16
• Foreign key attribute (FKA). This is a non-primary-
key attribute of a primary relation that is a foreign
key.
• Nonkey attribute (NKA). The rest of the non-
primary-key attribute.

Step 2 Map each PR1 into entity


• For each Type 1 primary relation (PR1), define a
corresponding entity type and identify it by the
primary key. Its nonkey attributes map to the
attributes of the entity types with the corresponding
domains.

2023/2/5 17
Step 3 Map each PR2 into a subclass entity or
weak entity
Case 1: CASE 2:

Relational schema EER model Relational schema EER model

PR1: Relation A (A1, A2) PR1: Relation A (A1, A2)


PR2: Relation B (*A1, B1, B2) A1 PR2: Relation B (*A1, B1) A1
A Strong entity A Superclass

A2 A2
1
R ISA

A1 n A1
Subclass
B1 Weak entity B1 B
B
B2

2023/2/5 18
Step 4 Map SR1 into binary/n-ary relationship

Relational schema EER model


PR1: Relation A (A1, A2)
PR1: Relation B (B1, B2)
SR1: Relation AB (*A1, *B1) A1 A
A2
m

R
n
B1
B2 B

2023/2/5 19
Step 5 Map SR2 into binary/n-ary relationship

Relational schema EER model


PR1: Relation A (A1, A2)
PR1: Relation B (B1, B2) A1 B1
A B
SR2: Relation C (*A1, *B1, C1) A2 B2
1 1

R1 R2
n m
A1
B1 C
C1

2023/2/5 20
Step 6 Map each FKA into relationship

Relational schema EER model

PR1: Relation A (A1, A2)


PR1: Relation B (B1, B2, *A1) A1 A
A2
1
R
n
B1
B2 B
2023/2/5 21
Step 7 Map inclusion dependency into semantic
If IDs have been derived between two relations, relation A with a as
primary key and b’ as foreign key, relation B with b as primary key
and a’ as foreign key, then
Case 1. Given ID: a’  a, then entity A is in 1:n relationship with entity
B.
Case 2. Given ID: a’ a, and b’  b, then entity A is in 1:1 relationship
with entity B.
Case 3. Given ID: a’ a, and b’ b, and a’b’ is a composite key, then
entity A is in m:n relationship with entity B.

Step 8 Draw EER model.


2023/2/5 22
Draw the derived EER model as a result of above steps.
An university enrollment system:
Relations:
Department (Dept#, Dept_name)
Instructor (*Dept#, Inst_name, Inst_addr)
Course (Course#, Course_location)
Prerequisite (Prer#, Prer_title, *Course#)
Student (Student#, Student_name)
Section (*Dept#, *Course#, *Inst_name, Section#)
Grade (*Dept#, *Inst_name, *Course#, *Student#,
*Section#, Grade)

2023/2/5 23
Relations and attributes classification table
Relation Rel Primary- KAP KAG FKA NKA
Name Type Key_____ ________ ________ ________ _________ _________
DEPT PR1 Dept# Dept_name
INST PR2 Dept# Dept# Inst_name Inst_addr
Inst_name
COUR PR1 Course# Course_location
STUD PR1 Student# Stud_name
PREP PR1 Prer Course# Prer_title
SECT SR2 Course# Course# Section# Inst_name
Dept# Dept#
Section#
Inst_name Inst_name
GRADE SR1 Inst_name Inst_name Grade
Course# Course#
Student# Student#
2023/2/5 Dept# Dept# 24

Section# Section#
Step 2. Map each PR1 into entity

Department Prerequisite Student Course

Dept# Pre# Student# Course#


Dept_name prer_title Student_name Course_Location

2023/2/5 25
Step 3. Map each PR2 into weak entity.

1 n
Department hire Instructor
Department hire

Dept# Dept#
Dept_name Inst_name
Inst_addr

2023/2/5 26
Step 4. Map SR1 into binary/n-
ary relationship.

m n
Student grade Section

Student# Grade Dept#


Section#
Student_name
Inst_name
Course#
Section#

2023/2/5 27
Step 5. Map SR2 into binary/n-ary
relationship

Instructor Course

Dept# 1 1 Course#
Inst_name
Course_Location
Inst_addr

teach has

n n
Dept#
Inst_Name
Section
Section
Course#
Section#

2023/2/5 28
Step 6. Map each FKA into relationship

Course pre-course Prerequisite


1 1
Course# Prer#
Course_Location Prer_title

2023/2/5 29
Step 7. Map each inclusion dependency
into semantics (binary/n-ary relationship)
Given derived inclusion dependency Derived Semantics
Instructor.Dept#  Department.Dept# n:1 relationship between entities
Instructor and Department
Section.Dept#  Department.Dept#
Section.Inst_name Instructor.Inst_name
Section.Course#  Course.Course# 1:n relationship between entities
Instructor and Section
and between Course and Section.
Grade.Dept#  Section.Dept#
Grade.Inst_name  Section.Inst_name
Grade.Course#  Section.Course#
Grade.Student#  Student.Student# m:n relationship between
relationship Section and entity Student.
Prerequisite.Course#  Course.Course#
Course.Prer#  Prerequisite.Prer# 1:1 relationship between Course and
2023/2/5 Prerequisite 30
Step 8. Draw EER model.
Department Prerequisite

1 Dept# 1 Prer#
Dept_name Prer_title

pre-course
hire

n 1

Instructor Student Course

Dept# 1 m Student# 1 Course#


Inst_name Student_name Course_location
Inst_addr
teach grade has
Grade
n n
n
Dept#
Section
Section Inst_Name
Course#
2023/2/5 Section# 31
Reading assignment
Chapter 3 Schema Translation in “Information
Systems Reengineering and Integration” by
Joseph Fong, published by Springer Verlag,
2006, pp. 115-121.

2023/2/5 32
Review question 2
What are the major differences between
Generalization and Categorization in terms
of data volume (data occurrences) in their
related superclass entity/entities and
subclass entity/entities?

Is there any special case such that these two


data semantics can be overlapped?

2023/2/5 33
Tutorial Question 2
In a reverse engineering approach, translate the following relational
schema to an Entity-relationship model.

Order (Order_code, Order_type, Our_reference, Order_date,


Approved_date, *Head, *Supplier_code)
Supplier (Supplier_code, Supplier_name)
Product (Product_code, Product_description)
Department (Department_code, Department_name)
Head (Head, *Department_code, Title)
Order_Product (*Order_code, *Product_code, Qty, Others, Amount)
Note (*Order_code, Sequence#, Note)

where underlined are primary keys and prefixed with ‘*’ are foreign keys.

2023/2/5 34
Architecture of multiple databases integration
Global
database
Relational
Database 1

: Step 3. Data Integration Look Up


Table
:

Relational
Database n Integrated
schema

Step 2. Schema Integration

EER Model ............ EER Model


1 n

Step 1. Reverse Engineering


(for Schema Translation)
............
Relational Relational
schema 1 Schema n
2023/2/5 1
Data Integration Concepts
• Data semantics define the relationships between data for
user’s data requirements.
• Data semantics are presented in database conceptual schema
such as EER model and DTD Graph.
• Only relevant data can be integrated for an application.
• Data relevance depends on user’s data requirements for an
application.
• Data consistency are the standard of data domain value and
format. Inconsistent data must be transformed before data
integration.
• User supervision means users input for user’s data
requirements, and which are for database design and schema
integration
2023/2/5 2
Schema integration of EER models

Begin For each existing database do


Begin If its conceptual schema does not exist
then reconstruct its EER model by reverse engineering;
For each pair of existing databases’ EER models schema A and schema B do
begin resolve semantic conflicts between EER model A and EER model B;/step1/
Merge classes between EER model A and EER model B; /*step2*/
Merge relationships between EER model A and EER model B; /*step3*/
end
end

2023/2/5 3
Step 1 Resolve conflicts among EER model

Resolve conflicts on synonyms and homonyms

Schema A Schema B Schema A' Schema B'

Entitya Entityb ====> Entitya Entitya


User
supervision
Attributex Attributex Attributex1 Attributex2

(note: Classa and Classb are synonyms, Attributex are


homonyms)
2023/2/5 4
Example

Schema A Schema B Schema A' Schema B'

Loan
Customer ====> Customer Customer
Borrower
name name name name
date date open-acct-date loan-begin-date

2023/2/5 5
Resolve conflicts on data types

S chem a A S chem a B T ransfo rm ed S chem a A '

n 1
Ax Ca Cx == ==> Ca R Cx

S chem a A S chem a B T ransfo rm ed S chem a A '

1 1
Ax Ca .C x == ==> Ca R .C x

S ch em a A S ch em a B Tra nsform ed S chem a A '

Ax n m
Ay .C a Cx == ==> Ay Ca R Cx

2023/2/5 6
Example
S chem a A S chem a B
T ra n s fo rm e d s c h e m a A
A B A A'
n 1
cu sto m e r Loan nam e
==>
ty p e Loan book
nam e
C u s to m e r C u s to m e r
typ e C o n tr a c t C o n tr a c t

S chem a A S chem a B T ra n s fo rm e d s c h e m a A
A B A A'
1 1
Loan ty p e
cu sto m e r
nam e
==> Loan book
nam e
C u s to m e r C u s to m e r
typ e C o n tr a c t C o n tr a c t

S chem a A S chem a B T ra n s fo rm e d s c h e m a A
A B A A'
n m
Loan ty p e
cu sto m e r nam e == > Loan book
nam e
C u s to m e r C u s to m e r
typ e C o n tr a c t C o n tr a c t

2023/2/5 7
Resolve conflicts on key

Schema A Schema B Transformed Schema X


Ay Ay Ay
Az Cx Cyx Az ====>
User Cx
.Az
supervision

2023/2/5 8
Example

Schem a A Schem a B T r a n s fo r m e d s c h e m a A
A B A

nam e nam e
C u s to m e r
nam e
C u s to m e r c u s to m e r#
===> C u s to m e r c u s to m e r#
c u s to m e r#

2023/2/5 9
Resolve conflicts on cardinality

Schema A Schema B Transformed Schema A'

1 1 1 n 1 n
Entity
x
R Entity
y
Entity
x
R Entity
y ====> Entity
x R Entity
y

2023/2/5 10
Example

2023/2/5 11
Resolve conflicts on weak entity
Schema A Schema B Transformed schema A'
A1 B1 A1

Entityx Attra Entityx Attra Entityx Attra


1 1
1 1
1 =====>
R
A R R
A A

A2 n A2
n
A2 n
Entityy Attrb Attra Attra
Entityyy Attrb Entityyy
Attrb
2023/2/5 12
Example

2023/2/5 13
Resolve conflicts on subtype entity

Schema A
A1 Schema B Schema X
1 A2 B1 B2 X1 X2
1 1
Entityx R Entityy Entityx R Entityy Entityx R Entityy

2023/2/5 14
Example

2023/2/5 15
Step 2 Merge entities

Merge classes by Union

Schema A Schema B Schema X


Attra
Attra Entityx Entityx Attrb ===> Entityx
Attrb

2023/2/5 16
Example

Schema A Schema B Schema X

customer#
A B x customer#
address customer# address
phone Customer Customer age ==> Customer phone
age

2023/2/5 17
Merge EER models by Generalization
Schema A Schema B Schema X
X
A B
Entityz Attrk
Entityx Attrk Entityy Attrk
======>
User d=disjoint
supervision d
X1 X2

Attrk Entityx Attrk Entityy

Schema A Schema B Schema X


A B
Entityz Attrk
Entityx Attrk Entityy Attrk
======>
O=overlap
o
X1 X2
Attrk
2023/2/5 Entityx Attrk Entityy18
Example
S c h e m a A S c h e m a B S c h e m a X

n a m e x
L o c a l O v e rs e a s C u s to m e r
C u s to m e r C u s to m e r = = = = = = >

n a m e n a m e d = d is jo in t g e n e r a lis a t io n
lo c a l_ a d d r o v e rs e a s _ a d d r d x 2
x 1
L o c a l o v e rs e a s _ a d d r
O v e rs e a s
lo c a l_ a d d r
C u s to m e r C u s to m e r

S c h e m a A S c h e m a B S c h e m a X
x
L o a n
C o m m e rc ia l M o rtg a g e n a m e
B o rro w e r
L o a n L o a n
B o rro w e r B o rro w e r = = = = = = >
n a m e n a m e
o = o v e r la p g e n e r a lis a t io n
lo a n _ a m t lo a n _ a m t
p r im e _ r a t e m o rtg a g e _ ra te
o
x 2
x 1

C o m m e ric a l m o rtg a g e _ ra te
M o rtg a g e
p r im e _ r a t e
lo a n _ a m t
L o a n L o a n
F ig u r e 8 M e rg e E E R m o d e ls b y g e n e a lis a t io n B o rr o w e r lo a n _ a m t B o rro w e r

2023/2/5 19
Merge EER models by Subtype
Relationship

Schema A Schema B Schema X


A B X1 X2
==>
Entityx Entityy Entityx R Entityy
User
supervision

2023/2/5 20
Example

S chem a B S chem a X
S chem a A
B X1 X2
A
In te re s t F ix e d In te re s t K K F ix e d
K
R a te R a te K ====> R a te is a R a te

2023/2/5 21
Merge entities by Aggregation
Schema A Schema B
Schema X
A B
X1
1 n
1 n
Entityx Entityy R Entityz ==> Entityy R1 Entityz

R2

Entityx X2

2023/2/5 22
Example

Schema X
Schema A Schema B
A B2 X1
B1 B
1 n 1 n
Loan Loan Loan
C u sto mer book C u sto mer book
Security C ontract ==> C ontract

1
secured
by

n
Loan
X2
Security

2023/2/5 23
Merge entities by categorization

Schema A Schema B
Schema X
X1
B Xc2
A1 A2 Xc1

EntityX EntityY EntityZ ===> EntityX EntityY


User
supervision C=categorization
C

EntityZ X2
2023/2/5 24
Example

S ch e m a X
S ch e m a A S ch e m a B
X1
A1 A2 B
X c1 X c2
M o rtg a g e C o m m e rcia l Loan M o rtg a g e C o m m e rcia l
lo a n lo a n C o n tr a c t = => lo a n lo a n

C = ca t e g o risa t io n C

Loan
X2
C o n tr a c t

2023/2/5 25
Merge entities by Implied Binary
Relationship

X1 Schema X
Schema A Schema B AttrX X2
A B
n 1
AttrX
Entity a Entity b AttrX ====> Entity a R Entity b
Schema X
Schema A Schema B X1 AttrY AttrX
A B X2
AttrX 1 1
AttrY ====> R Entity b
Entity a Entity b AttrX Entity a
AttrY
2023/2/5 26
Example
Schema A Schema B Schema X
X2
A B X1

Loan Loan n 1
Customer ==> book Customer
Contract Contract
loan#
customer# customer# loan# customer#

Schema A Schema B Schema X


X2
A B X1

Loan Loan 1 1
Customer ==> book Customer
Contract Contract
loan# loan#
customer# customer# loan# customer#

2023/2/5 27
Step 3 Merge relationships
Merge relationships by subtype relationship

Schema A Schema B
Schema X
A1 A A2 B1 B2 X1 X X3
1 1 1 1
EntityX R EntityY EntityX R EntityY EntityX R EntityZ

isa

X2 EntityY

2023/2/5 28
Example

2023/2/5 29
Merge relationships by overlap generalization

SchemaA
SchemaB
SchemaX
A1 A A2 B1 B2
1 n 1 1 n X1
EntityX R1 EntityY EntityX R2 EntityY ===> EntityX
1 1
R1 R2
n n
X3 Entity X4 EntityZ2
Z1
O

EntityY
X2

2023/2/5 30
Example
Schema A Schema B Schema X
A B X1
A1 A2 B1 B2
1 n 1 n Bank
Bank
home
Customer Bank
auto
Customer ===> 1
loan loan 1
home auto
loan loan

X3 n n X4
Home loan Auto loan
borrower borrower

o=overlap generalisation O
X2
Customer

2023/2/5 31
Absorbing Lower degree
Relationship into a Higher degree
Relationship
SchemaA
SchemaB
SchemaX
A1 A2 B1 B2
X1 X2
m : n 1 1 n
EntityX EntityY EntityX R2 EntityY ===> m : n
EntityX EntityY
m R1 m
m R1 m
:
: :
n :
n n
EntityZ n
2023/2/5 X3 EntityZ 32
Example

2023/2/5 33
Case Study: In a bank, there are existing databases with different
schemas: one for the customers, another for the mortgage loan contracts
and a third for the index interest rate. However, there is a need to
integrate them together for an international banking loan system.
loan contract#
begin_date
Mortgage mature_date ID#
ID# Customer customer_name
loan contract loan-status date
1 1
1 1
1
1

draw accure repaid balanced accumulate opened

1 1 (0,n) 1 n
n
Mortgage
Mortgage Loan
Loan Mortgage
Mortgage loan
loan
loan interest
interest loan
loan account
drawdown
drawdown type
type balance
balance history
loan contract# past loan# account#
loan contract# Loan_contract#
loan_bal_date past_loan_status
drawdown_date Interest_effective_date
drawdown_amt fixed_rate balance_amt
accurred_interest Mortgage
Mortgage
loan
loan
repayment
index Interest_effective_date
repayment
interest index_rate
loan contract# type accured interest
repayment_date
repayment

2023/2/5 34
Step 2 Merge entities by Implied binary relationship and Generalization
Since data ID# appears in entity Mortgage Loan Contract and entity
Customer, they are in one-to-many cardinality with foreign key on the
“many” side. loan contract#
begin_date m ID#
Mortgage 1
mature_date book Customer customer_name
loan contract date

Since Fixed and Index Interest Rate both have the same key,
Interest_effective_date, they can be generalized disjointed as either
fixed or index rate.
Loan
interest
type
Interest_effective_date
Interest_type
d
fixed index
interest interest
rate type
Interest_effective_date
Interest_effective_date Index_rate
2023/2/5 fixed_rate Index_name 35
accured_interest accured_interest
Thus, we can derive cardinality from the implied relationship between
these entities, and integrate the two schemas into one EER model.

loan contract#
begin_date m ID#
Mortgage 1
mature_date book Customer customer_name
loan contract date
1 1
1 1
1
1

draw accure repaid balanced accumulate opened


n
1 (0,n) 1 n
Mortgage
Mortgage n
loan
loan
Loan
Loan Mortgage
Mortgage Mortgage
Mortgage loan
drawdown
drawdown interest
interest loan
loan loan
loan account
loan contract# Loan_contract#
type
type repayment
repayment balance
balance history
drawdown_date Interest_effective_date loan contract# loan contract# account#
drawdown_amt Interest_type past loan#
repayment_date loan_bal_date
past_loan_status
d repayment balance_amt

fixed index
interest interest
rate type
Loan_contract# Loan_contract#
Interest_effective_date Interest_effective_date
fixed_rate Index_rate
accured_interest accured_interest

2023/2/5 36
Reading assignment
“Information Systems Reengineering and
Integration” by Joseph Fong, published by
Springer Verlag, 2006, pp. 282-310.

2023/2/5 37
Review question 3
Can two Relational Schemas be integrated into one
Relational Schema? Explain your answer.

Which steps need user supervision for integrating


two Extended Entity Relationship Models into one
Extended Entity Relationship Model? Explain
your answer.

2023/2/5 38
Tutorial question 3
Provide an integrated schema for the following two views which are merged to create a bibliographic database. During
identification of correspondences between the two views, the users discover the followings:
RESEARCHER and AUTHOR are synonyms,
CONTRIBUTED_BY and WRITTEN_IN are synonyms,
ARTICLES belongs to a SUBJECT.
ARTICLES and BOOK can be generalized as PUBLICATION.

Hints: Given two subclass entities have same relationship(s). The two subclasses entities can be generalized into a

superclass entity and the subclass relationship(s) can also be generalized into a superclass relationship..
View 1
Title
n 1 Volume
ARTICLE Published_in JOURNAL
Size Number

Contributed_by

Full_Name RESEARCHER

View 2
Title
n 1 Classification_id
BOOK Belongs_to SUBJECT
Publisher Name

n
Written_in

m
Full_Name AUTHOR
2023/2/5 39
Methodology for Data warehousing with OLAP

2008/2/19 1
Data conversion: Customerized Program Approach

A common approach to data conversion is to


develop customized programs to transfer data
from one environment to another. However, the
customized program approach is extremely
expensive because it requires a different program
to be written for each M source files and N target.
Furthermore, these programs are used only once.
As a result, totally depending on customized
program for data conversion is unmanageable,
too costly and time consuming.
2008/2/19 2
Interpretive Transformer approach of data
conversion

Definitions of
source, target,
and mapping

Interpretive
source target
transformer

2008/2/19 3
Interpretive transformer
For instance, the following Cobol structure
Level 0 PERSON

1 NAME AGE CARS

2 LIC#1 MAKE ACCIDENT

3 LIC#2 NAME

can be expressed in using these SDDL statements:

Data Structure PERSON <NAME, AGE>


Instance CARS <LIC#1, MAKE>
N-tuples ACCIDENTS <LIC#2, NAME>
Relationship PERSON-CAR <NAME, LIC#1>
N-tuples CAR-ACCIDENT <LIC#1, LIC#2>

2008/2/19 4
To translate the above three levels to the following two levels data structure:

Level 0 PERSON

1 NAME AGE CARS

2 LIC#1 MAKE LIC#2 NAME

the TDL statements are

FORM NAME FROM NAME


FORM LIC#1 FROM LIC#1
:
FORM PERSON IF PERSON
FORM CARS IF CAR AND ACCIDENT
2008/2/19 5
Translator Generator Approach of data
conversion
Definitions of
source, target,
and mapping

Translator
generator

Specialized
source target
program

2008/2/19 Translator generator 6


DEFINE and CONVERT compile phase
DEFINE PL/1
Source Program
Reader (DEFINE)
DEF S1 compiler phase Reader (S1)
DEF S2 Reader (S2)
: DEFINE :
: Compiler :

Convert
catalog

CONVERT PL/1
program Restructurer Procedure
(CONVERT)
Statement 1 compiler phase COP 1
Statement 2 COP 2
: DEFINE :
: Compiler :

CONVERT Execution
Catalogue Schedule

2008/2/19 7
As an example, consider the following hierarchical database:

DNO MGR BUDGET

ENO JOB PJNO LEADER

ITEMNO DESC

2008/2/19 8
Its DEFINE statements can be described in the following where
for each DEFINE statement, a code is generated to allocate a
new subtree in the internal buffer.

GROUP DEPT:
OCCURS FROM 1 TIMES;
FOLLOWED BY EOF;
PRECEDED BY HEX ‘01’;
:
END EMP;
GROUP PROJ:
OCCURS FROM 0 TIMES;
PRECEDED BY HEX ‘03’;
:
END PROJ;
2008/2/19 9
END DEPT;
For each user-written CONVERT statement, we can produce a customized
program. Take the DEPT FORM from the above DEFINE statement:

T1 = SELECT (FROM DEPT WHERE BUDGET GT ‘100’);

will produce the following program:

/* PROCESS LOOP FOR T1 */


DO WHILE (not end of file);
CALL GET (DEPT);
IF BUDGET > ‘100’
THEN CALL BUFFER_SWAP (T1, DEPT);
END
2008/2/19 10
Data conversion: Logical Level Translation
Approach

Lu?c d? Lu?c d?
quan h? quan h?
ngu?n dích

L? a ch?n?

Quá trình Chuy?n Các t?p Quá trình


CSDL quan dua ra Các t?p d?i dua vào CSDL quan
tu?n t?
h? ngu?n tu?n t? h? dích
dích

System flow diagram for data conversion from source relational to


2008/2/19
target relational 11
Case study of converting a relational database from DB2 to Oracle

Business requirements
A company has two regions A and B.
•Each region forms its own departments.
•Each department approves many trips in a year.
•Each staff makes many trips in a year.
•In each trip, a staff needs to hire cars for
transportation.
•Each hired car can carry many staff for each trip.
•A2008/2/19
staff can be either a manager or an engineer. 12
Data Requirements

Relation Department(*Department_id, Salary)


Relation Region_A (Department_id, Classification)
Relation Region_B (Department_id, Classification)
Relation Trip (Trip_id, *Car_model, *Staff_id, *Department_id)
Relation People (Staff_id, Name, DOB)
Relation Car (Car_model, Size, Description)
Relation Assignment(*Car_model, *Staff_id)
Relation Engineer (*Staff_id, Title)
Relation Manager (*Staff_id, Title)

ID: Department.Department_id  (Region_A.Department_id  Region_B.Department_id)


ID: Trip.Department_id  Department.Department_id
ID: Trip.Car_model  Assignment.Car_model
ID: Trip.Staff_id  Assignment.Staff_id
ID: Assignment.Car_model  Car.Car_model
ID: Assignment.Staff_id  People.Staff_id
ID: Engineer.Staff_id  People.Staff_id
ID: Manager.Staff_id  People.Staff_id

Where ID = Inclusion Dependence, Underlined are primary keys and “*” prefixed are foreign keys.
2008/2/19 13
Extended Entity Relationship Model for database Trip
Department_id
Department_id
Classification Region_A Region_B
Classification

Department_id
Department Salary

1
R1

m
Trip Trip_id

m
R2
1
Staff_id Car_model
Name People
m Assignment n Car Size
DOB Description

Staff_id Staff_id
Manager Engineer
Title Title

2008/2/19 14
Schema for target relational database Trip
Create Table Car
(CAR_MODEL character (10),
SIZE character (10),
DESCRIPT character (20),
STAFF_ID character (5),
primary key (CAR_MODEL))

Create Table Depart


(DEP_ID character (5),
SALARY numeric (8),
primary key (DEP_ID))

Create Table People


(STAFF_ID character (4),
NAME character (20),
DOB datetime,
primary key (STAFF_ID))

Create Table Reg_A


(DEP_ID character (5),
DESCRIP character (20),
primary key (DEP_ID))

Create Table Reg_B


(DEP_ID character (5),
2008/2/19 15
DESCRIPT character (20),
primary key (DEP_ID))
Schema for target relational database Trip (continue)
Create Table trip
(TRIP_ID character (5),
primary key (TRIP_ID))

Create Table Engineer


(TITLE character (20),
STAFF_ID character (4),
Foreign Key (STAFF_ID) REFERENCES People(STAFF_ID),
Primary key (STAFF_ID))

Create Table Manager


(TITLE character (20),
STAFF_ID character (4),
Foreign Key (STAFF_ID) REFERENCES People(STAFF_ID),
Primary key (STAFF_ID))

Create Table Assign


( CAR_MODEL character (10),
Foreign Key (CAR_MODEL) REFERENCES Car(CAR_MODEL),
STAFF_ID character (4),
Foreign Key (STAFF_ID) REFERENCES People(STAFF_ID),
primary key (CAR_MODEL,STAFF_ID))

ALTER TABLE trip ADD CAR_MODEL character (10) null

ALTER TABLE trip ADD STAFF_ID character (4) null


2008/2/19 16
ALTER TABLE trip ADD Foreign Key (CAR_MODEL,STAFF_ID) REFERENCES Assign(CAR_MODEL,STAFF_ID)
Data insertion for target relational database Trip
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('DA-02 ', '165 ', 'Long car ', 'A001 ')
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('MZ-18 ', '120 ', 'Small sportics ', 'B004 ')
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('R-023 ', '150 ', 'Long car ', 'A002 ')
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('SA-38 ', '120 ', 'New dark blue ', 'D001 ')
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('WZ-01 ', '1445 ', 'Middle Sportics ', 'B004 ')
INSERT INTO trip_ora.DEPART (DEP_ID,SALARY) VALUES ('AA001', 35670)
INSERT INTO trip_ora.DEPART (DEP_ID,SALARY) VALUES ('AB001', 30010)
INSERT INTO trip_ora.DEPART (DEP_ID,SALARY) VALUES ('BA001', 22500)
INSERT INTO trip_ora.DEPART (DEP_ID,SALARY) VALUES ('BB001', 21500)
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('A001', 'Alexender ', '7-1-1962')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('A002', 'April ', '5-24-1975')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('B001', 'Bobby ', '12-5-1987')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('B002', 'Bladder ', '1-3-1980')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('B003', 'Brent ', '12-15-1979')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('B004', 'Brelendar ', '8-18-1963')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('C001', 'Calvin ', '4-3-1977')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('C002', 'Cheven ', '2-2-1974')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('C003', 'Clevarance ', '12-6-1987')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('D001', 'Dave ', '8-17-1964')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('D002', 'Davis ', '3-19-1988')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('D003', 'Denny ', '8-5-1985')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('D004', 'Denny ', '2-21-1998')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AA001', 'Class A Manager ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AB001', 'Class A Manager ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AC001', 'Class A Manager ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AD001', 'Class A Assistnat ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AE001', 'Class A Assistnat ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AF001', 'Class A Assistnat ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AG001', 'Class A Clark ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AH001', 'Class A Assistant ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BA001', 'Class B Manager ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BB001', 'Class B Manager ')
INSERT2008/2/19
INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BC001', 'Class B Manager ') 17
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BD001', 'Class B Assistant ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BE001', 'Class B Assistant ')
Data insertion for target relational database Trip (continue)
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BF001', 'Class B Assistant ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BG001', 'Class B Clark ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BH001', 'Class B Assistant ')

INSERT INTO trip_ora.TRIP (TRIP_ID) VALUES ('T0001')


INSERT INTO trip_ora.TRIP (TRIP_ID) VALUES ('T0002')
INSERT INTO trip_ora.TRIP (TRIP_ID) VALUES ('T0003')
INSERT INTO trip_ora.TRIP (TRIP_ID) VALUES ('T0004')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Electronic Engineer ', 'A001')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Senior Engineer ', 'B003')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Electronic Engineer ', 'B004')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Junior Engineer ', 'C002')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('System Engineer ', 'D001')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Material Engineer ', 'D002')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Parts Engineer ', 'D003')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('Sales Manager ', 'A002')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('Marketing Manager ', 'B001')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('Sales Manager ', 'B002')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('General Manager ', 'C001')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('Marketing Manager ', 'C003')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('MZ-18 ', 'A002')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('MZ-18 ', 'B001')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('MZ-18 ', 'D003')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('R-023 ', 'B004')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('R-023 ', 'C001')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('R-023 ', 'D001')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('SA-38 ', 'A001')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('SA-38 ', 'A002')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('SA-38 ', 'D002')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('WZ-01 ', 'B002')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('WZ-01 ', 'B003')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('WZ-01 ', 'C002')
2008/2/19 18
Data insertion for target relational database Trip (continue)
UPDATE trip_ora52.TRIP
SET DEPT_ID= 'AA001'
, CAR_MODEL= 'MZ-18 '
, STAFF_ID= 'A002'
, CAR_MODEL= 'MZ-18 '
, STAFF_ID= 'A002'
WHERE TRIP_ID='T0001‘

UPDATE trip_ora52.TRIP
SET DEPT_ID= 'AA001'
, CAR_MODEL= 'MZ-18 '
, STAFF_ID= 'B001'
, CAR_MODEL= 'MZ-18 '
, STAFF_ID= 'B001'
WHERE TRIP_ID='T0002'

UPDATE trip_ora52.TRIP
SET DEPT_ID= 'AB001'
, CAR_MODEL= 'SA-38 '
, STAFF_ID= 'A002'
, CAR_MODEL= 'SA-38 '
, STAFF_ID= 'A002'
WHERE TRIP_ID='T0003'

UPDATE trip_ora52.TRIP
SET DEPT_ID= 'BB001'
, CAR_MODEL= 'SA-38 '
, STAFF_ID= 'D002'
, CAR_MODEL= 'SA-38 '
2008/2/19 19
, STAFF_ID= 'D002'
WHERE TRIP_ID='T0004'
Data Integration
Step 1: Merge by Union
Relation Ra
A1 A2
Relation Rx
a11 a21
A1 A2 A3
a12 a22
a11 a21 null

Relation Rb ==> a12 a22 null

A1 A3 a13 null a31

a13 a31 a14 null a32

a14 a32

2008/2/19 20
Step 2: Merge classes by
generalization
Relation Ra Relation Rx1
A1 A2 A1 A2 Relation Rx
a11 a21 a11 a21
A1 A2 A3
a12 a22 a12 a22
a11 a21 null
==> a12 a22
Relation Rb Relation Rx2 null

A1 A3 A1 A3 a13 null a31

a13 a31 a13 a31 a14 null a32

a14 a32 a14 a32

2008/2/19 21
Step 3: Merge classes by
inheritance
Relation Rb Relation Rxb
A1 A2 A1 A2 A3

a11 a21 a11 a21 a31

a12 a22 a12 a22 null

==>
Relation Ra Relation Rxa

A1 A3 A1 A3

a11 a31 a11 a31

2008/2/19 22
Step 4 Merge classes by
aggregation
Relation Rx
Relation Ra Relation Rx’
Relation Rx1
A1 A2 A1 A3 A5 A1 A2 A1 A3 *A5
a11 a21 a11 a31 a11 a21
a51 a11 a31 a51
a12 a22 a12 a32 a12 a22
a51 a12 a32 a51

Relation Rb Relation Ry ==>


Relation Rx2 Relation Ry
A3 A4 A5 A6
A3 A4 A5 A6
a31 a41 a51 a61
a31 a41 a51 a61
a32 a42 a52 a62
a32 a42 a52 a62

2008/2/19 23
Step 5 Merge classes by
categorization
Relation Ra Relation Ra
A1 A2 A1 A2
Relation Rx
a11 a21 a11 a21
A1 A4
a12 a22 a12 a22
a11 a41

Relation Rb ==> a12 a42


Relation Rb
a13 a43
A1 A3 A1 A3

a13 a31 a14 a44 a13 a31

a14 a32 a14 a32

2008/2/19 24
Step 6 Merge classes by implied
relationship

Relation Ra Relation Rb Relation Xa Relation Xb


A1 A2 A3 A1 A1 A2 A3 *A1

a11 a21 a31 a11 ==> a11 a21 a31 a11

a12 a22 a32 a12 a12 a22 a32 a12

2008/2/19 25
Step 7 Merge relationship by
subtype
Relation Ra Relation Rb Relation Xa Relation Xc
A1 A2 A3 *A1 A1 A2 *A3 *A1
a11 a21 a31 a11 a11 a21 a31 a11
a12 a22 a32 a12 a12 a22 a32 a12

Relation Ra' Relation Rb' ==> Relation Xb


A1 A2 A3 A1 A3 A1
a11 a21 a31 a11 a31 a11
a12 a22 a33 null a32 a12

a33 null
2008/2/19 26
Data conversion from relational into XML

Step 1: Step 2: XML DTD &


Relational Reverse EER Model Schema DTD Graph
Engineering Translation
Schema

Step 3:
Data XML Document
Relational
Conversion
Databse

Architecture of Re-engineering Relational Database into XML Document


2008/2/19 27
Methodology of converting RDB into XML

As the result of the schema translation, we


translate an EER model into different views of
XML schemas based on their selected root
elements. For each translated XML schema, we
can read its corresponding source relation
sequentially by embedded SQL starting a parent
relation. The tuple can then be loaded into XML
document according to the mapped XML DTD.
Then we read the corresponding child relation
tuple(s), and load them into XML document.
2008/2/19 28
Algorithm of transforming RDB into XML
begin
while not end of element do
read an element from the translated target DTD;
read the tuple of a corresponding relation of the element from the source relational database;
load this tuple into a target XML document;
read the child elements of the element according to the DTD;
while not at end of the corresponding child relation in the source relational database do
read the tuple from the child relation such that the child's corresponding to the processed
parent relation's tuple;
load the tuple to the target XML document;
end loop // end inner loop
end loop // end outer loop
end

2008/2/19 29
Step 1: Reverse Engineering Relational
Schema into an EER Model

By use of classification tables to define the


relationship between keys and attributes in
all relations, we can recover their data
semantics in the form of an EER model.

2008/2/19 30
Step 2: Data Conversion from Relational into
XML document

We can map the data semantics in the EER


model into DTD-graph according to their
data dependencies constraints. These
constraints can then be transformed into
DTD as XML schema.

2008/2/19 31
Step 2.1 Defining a Root Element
To select a root element, we must put its relevant
information into an XML schema. Relevance
concerns with the entities that are related to a
selected entity by the user. The relevant classes
include the selected entity and all its relevant entities
that are navigable.

In an EER mode, we can navigate from entity to


another entity in correspondence to XML hierarchical
containment tree model.

2008/2/19 32
Root
EntityA
1 1
R1 R4 1 Element E
n n SelectedEntity

EntityB EntityE
1 n * n *
1 1 Element A Element F Element H
1 1 1
R2 R3 R5 R7 Mapping
1
n * n *
Element B Element G
n n n n
1
EntityC EntityD EntityF EntityH
n
* n *
1 Element C Element D
R6
n Relevant
EntityG Entities MappedXMLView

EERModel

2008/2/19 33
Step 2.1 Mapping Cardinality
from RDB to XML

In DTD, we translate one-to-one cardinality into


parent and child element and one-to-many
cardinality into parent and child element with
multiple occurrences. In many-to-many
cardinality, it is mapped into DTD of a hierarchy
structure with ID and IDREF.

2008/2/19 34
One-to-one cardinality
EER Model DTD Graph

A1 A1
Entity A Element A
A2 A2
1

1
Schema
B1 B1
Entity B Element B
B2 Translation B2

Relational Schema DTD

Relation A(A1, A2) <!ELEMENT A(B)>


Relation B(B1, B2, *A1) <!ATTLIST A A1 CDATA #REQUIRED>
<!ATTLIST A A2 CDATA #REQUIRED>
<!ELEMENT B EMPTY>
2008/2/19 <!ATTLIST B B1 CDATA #REQUIRED>
35
<!ATTLIST B B2 CDATA #REQUIRED>
One-to-one cardinality

Relation A XML Document


A1 A2
a11 a21 <A A1="a11" A2="a21">
a12 a22 Data <B B1="b11" B2="b21"></B>
</A>
Relation B Conversion
B1 B2 *A1 <A A1="a12" A2="a22">
b11 b21 a11 <B B1="b12" B2="b22"></B>
b12 b22 a12 </A>

2008/2/19 36
One-to-many cardinality
EER Model DTD Graph

A1 A1
Entity A Element A
A2 A2
1

R *
n Schema

B1 Translation B1
Entity B Element B
B2 B2

Relational Schema DTD

Relation A(A1, A2) <!ELEMENT A(B)*>


Relation B(B1, B2, *A1) <!ATTLIST A A1 CDATA #REQUIRED>
<!ATTLIST A A2 CDATA #REQUIRED>
<!ELEMENT B EMPTY>
<!ATTLIST B B1 CDATA #REQUIRED>
2008/2/19 <!ATTLIST B B2 CDATA #REQUIRED>
37
One-to-many cardinality

Relation A XML Document


A1 A2
a11 a21
<A A1="a11" A2="a21">
a12 a22 Data <B B1="b11" B2="b21"></B>
</A>
Relation B Conversion
B1 B2 *A1 <A A1="a12" A2="a22">
b11 b21 a11 <B B1="b12" B2="b22"></B>
b12 b22 a12 <B B1="b13" B2="b23"></B>
b13 b23 a12 </A>

2008/2/19 38
Many-to-many cardinality
EER Model DTD Graph

A1
Entity A
A2

R A1 B1
n A2 Element A Element R Element B B2
A_id B_id
B1
Entity B
B2 Schema A_idref B_idref

Translation DTD

<!ELEMENT A EMPTY>
<!ATTLIST A A1 CDATA #REQUIRED>
Relational Schema <!ATTLIST A A2 CDATA #REQUIRED>
<!ATTLIST A A_id ID #REQUIRED>
Relation A(A1, A2) <!ELEMENT R EMPTY>
Relation B(B1, B2) <!ATTLIST R A_idref IDREF #REQUIRED>
Relation R(*A1, *B1) <!ATTLIST R B_idref IDREF #REQUIRED>
<!ELEMENT B EMPTY>
<!ATTLIST B B1 CDATA #REQUIRED>
2008/2/19 <!ATTLIST B B2 CDATA #REQUIRED> 39
<!ATTLIST B B_id ID #REQUIRED>
Many-to-many cardinality

Relation A
A1 A2 XML Document
a11 a21
<A A1="a11" A2="a21" A_id="1"></A>
a12 a22
<B B1="b11" B2="b21" B_id="2"></B>
<R A_idref="1" B_idref="2"></R>
Relation B
B1 B2 Data
<A A1="a12" A2="a22" A_id="3"></A>
b11 b21 Conversion <B B1="b12" B2="b22" B_id="4"></B>
b12 b22 <R A_id="3" B_idref="4"></R>

Relation R <R A_id=”1" B_idref=”4"></R>


*A1 *B1
a11 b11
a12 b12
a11 b12

2008/2/19 40
Case Study
Consider a case study of a Hospital Database
System. In this system, a patient can have
many record folders. Each record folder can
contain many different medical records of the
patient. A country has many patients. Once a
record folder is borrowed, a loan history is
created to record the details about it.

2008/2/19 41
Hospital Relational Schema
Relation Patient (HK_ID, Patient_Name)
Relation Record_Folder (Folder_No, Location, *HKID)
Relation Medical_Record (Medical_Rec_No,
Create_Date, Sub_Type, *Folder_No)
Relation Borrower (*Borrower_No, Borrower_Name)
Relation Borrow (*Borrower_No, *Folder_No)

Where underlined are primary keys, prefixed with “*”


are foreign keys
2008/3/1 42
Patient table
Relational
HK_ID
Data
Patient name
E3766849 Smith

Record_Folder table Folder no Location *HKID


F_21 Hong Kong E3766849
F_24 New Territories E3766849

Borrower table Folder no Borrower no


F_21 B1
F_21 B11
F_21 B21
F_21 B22
F_24 B22

2008/3/1 43
Borrower_Name Table Borrower_no Borrower_name
B1 Johnson
B11 Choy
B21 Fung
B22 Lok

Medical_Record Table Medical_Rec_no Create_Date Sub_type Folder_no


M_311999 Jan-01-1999 W F_21
M_322000 Nov-12-1998 W F_21
M_352001 Jan-15-2001 A F_21
M_362001 Feb-01-2001 A F_21
M_333333 Mar-03-2001 A F_24

2008/2/19 44
Step 1 Reverse engineer relational database into an EER model

Patient

belong
n

Medical n 1 Record
contain Borrower
Folder Folder

1 1

by has
n
m

Borrow

2008/2/19 45
Step 2 Translate EER model into
DTD Graph and DTD
In this case study, suppose we concern the
patient medical records, so the entity Patient is
selected. Then we define a meaningful name
for the root element, called Patient_Records.
We start from the entity Patient in the EER
model and then find the relevant entities for it.
The relevant entities include the related entities
that are navigable from the parent entity.

2008/2/19 46
Translated Document Type Definition Graph

Patient
Records

Patient

Record
Folder

* *

Medical
Borrow
Folder

Borrower

2008/2/19 47
Translated Document Type Definition

<!ELEMENT Patient_Records (Patient+)>

<!ELEMENT Patient (Record_Folder*)>


<!ATTLIST Patient HKID CDATA #REQUIRED>
<!ATTLIST Patient Patient_Name CDATA #REQUIRED>

<!ELEMENT Record_Folder (Borrow*, Medical_Record*)>


<!ATTLIST Record_Folder Folder_No CDATA #REQUIRED>
<!ATTLIST Record_Folder Location CDATA #REQUIRED>

<!ELEMENT Borrow (Borrower)>


<!ATTLIST Borrow Borrower_No CDATA #REQUIRED>

<!ELEMENT Medical_Record EMPTY>


<!ATTLIST Medical_Record Medical_Rec_No CDATA #REQUIRED>
<!ATTLIST Medical_Record Create_Date CDATA #REQUIRED>
<!ATTLIST Medical_Record Sub_Type CDATA #REQUIRED>

<!ELEMENT Borrower EMPTY>


<!ATTLIST Borrower Borrower_name CDATA #REQUIRED>

2008/3/1 48
Transformed XML document
Patient_Records>
<Patient Country_No="C0001" HKID="E3766849" Patient_Name="Smith">
<Record_Folder Folder_No="F_21" Location="Hong Kong">
<Borrow Borrower_No="B1">
<Borrower Borrower_name=“Johnson" />
</Borrow>
<Borrow Borrower_No="B11">
<Borrower Borrower_name=“Choy" />
</Borrow>
<Borrow Borrower_No="B21">
<Borrower Borrower_name=“Fung" />
</Borrow>
<Borrow Borrower_No="B22">
<Borrower Borrower_name=“Lok" />
</Borrow>
<Medical_Record Medical_Rec_No="M_311999" Create_Date="Jan-1-1999“, Sub_Type="W"></Medical_Record>
<Medical_Record Medical_Rec_No="M_322000" Create_Date="Nov-12-1998“, Sub_Type="W"></Medical_Record>
<Medical_Record Medical_Rec_No="M_352001" Create_Date="Jan-15-2001“, Sub_Type="A"></Medical_Record>
<Medical_Record Medical_Rec_No="M_362001" Create_Date="Feb-01-2001“, Sub_Type="A"></Medical_Record>
</Record_Folder>
<Record_Folder Folder_No="F_24" Location="New Territories">
<Borrow Borrower_No="B22">
<Borrower Borrower_name=“Lok" />
</Borrow>
<Medical_Record Medical_Rec_No="M_333333" Create_Date="Mar-03-01“, Sub_Type="A"></Medical_Record>
</Record_Folder>
</Patient>

2008/2/19 49
Reading Assignment
Chapter 4 Data Conversion of “Information
Systems Reengineering and Integration”
by Joseph Fong, Springer Verlag, pp.160-
198.

2008/2/19 50
Lecture review question 6
How do you compare the pros and cons of
using “Logical Level Translation Approach”
with “Customized Program Approach” in
data conversion?

2008/2/19 51
CS5483 Tutorial Question 6
Convert the following relational database into an XML document:
Relation Car_rental
Car_model Staff_ID *Trip_ID
MZ-18 A002 T0001
MZ-18 B001 T0002
R-023 B004 T0001
R-023 C001 T0004
SA-38 A001 T0003
SA-38 A002 T0001

Relation Trip
Trip_ID *Department_ID
T0001 AA001
T0002 AA001
T0003 AB001
T0004 BA001

Relation Department
Department_ID Salary
AA001 35670
AB001 30010
BA001 22500
2008/3/1 52
Fall 2004, CIS, Temple University

CIS527: Data Warehousing, Filtering, and


Mining

Lecture 2

◼ Data Warehousing and OLAP Technology for Data Mining

Lecture slides taken/modified from:


◼ Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)

1
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ Data warehouse implementation

◼ Further development of data cube technology

◼ From data warehousing to data mining

2
What is Data Warehouse?

◼ Defined in many different ways, but not rigorously.


◼ A decision support database that is maintained
separately from the organization’s operational
database
◼ Support information processing by providing a solid
platform of consolidated, historical data for analysis.
◼ “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile (k mất đi) collection of data
in support of management’s decision-making process.”—
W. H. Inmon
◼ Data warehousing:
◼ The process of constructing and using data
warehouses
3
Data Warehouse—Subject-Oriented

◼ Provide a simple and concise (súc tích) view around


particular subject issues by excluding data that are not
useful in the decision support process.
◼ Organized around major subjects, such as customer,
product, sales.
◼ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.

4
Data Warehouse—Integrated
◼ Constructed by integrating multiple, heterogeneous
data sources
◼ relational databases, flat files, on-line transaction
records
◼ Data cleaning and data integration techniques are
applied.
◼ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
◼ E.g., Hotel price: currency, tax, breakfast covered, etc.
◼ When data is moved to the warehouse, it is
converted.

5
Data Warehouse—Time Variant

◼ The time horizon for the data warehouse is significantly


longer than that of operational systems.
◼ Operational database: current value data.
◼ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
◼ Every key structure in the data warehouse
◼ Contains an element of time, explicitly or implicitly
◼ However, the key of operational data may or may not
contain “time element”.

6
Data Warehouse—Non-Volatile

◼ A physically separate store of data transformed from the


operational environment.
◼ Operational update of data does not necessarily occur in
the data warehouse environment.
◼ Does not require transaction processing, recovery,
and concurrency control mechanisms
◼ Often requires only two operations in data accessing:
◼ initial loading of data and access of data.

7
Data Warehouse vs. Heterogeneous DBMS

◼ Traditional heterogeneous DB integration:


◼ Build wrappers/mediators on top of heterogeneous databases
◼ Query driven approach
◼ A query posed to a client site is translated into queries
appropriate for individual heterogeneous sites; The results are
integrated into a global answer set
◼ Involving complex information filtering
◼ Competition for resources at local sources
◼ Data warehouse: update-driven, high performance
◼ Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis

8
Data Warehouse vs. Operational DBMS
◼ OLTP (on-line transaction processing)
◼ Major task of traditional relational DBMS
◼ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
◼ OLAP (on-line analytical processing)
◼ Major task of data warehouse system
◼ Data analysis and decision making
◼ Distinct features (OLTP vs. OLAP):
◼ User and system orientation: customer vs. market
◼ Data contents: current, detailed vs. historical, consolidated
◼ Database design: ER + application vs. star + subject
◼ View: current, local vs. evolutionary, integrated
◼ Access patterns: update vs. read-only but complex queries
9
Why Separate Data Warehouse?
◼ High performance for both systems
◼ DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
◼ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
◼ Different functions and different data:
◼ Decision support requires historical data which
operational DBs do not typically maintain
◼ Decision Support requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◼ Different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
10
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ Data warehouse implementation

◼ Further development of data cube technology

◼ From data warehousing to data mining

11
A Multi-Dimensional Data Model

◼ A data warehouse is based on a multidimensional data model which


views data in the form of a data cube
◼ A data cube allows data to be modeled and viewed in multiple
dimensions
◼ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
◼ Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
◼ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
12
A Sample Data Cube
Total annual sales
Time of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

13
4-D Data Cube

Supplier 1

Supplier 2

Supplier 3

14
Cube: A Lattice of Cuboids

all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,item time,location item,location location,supplier


2-D cuboids
time,supplier item,supplier

time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier
15
Conceptual Modeling of Data Warehouses

◼ Modeling data warehouses: dimensions & measures


◼ Star schema: A fact table in the middle connected to a
set of dimension tables
◼ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
◼ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
16
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
17
Example of Snowflake Schema

time item
time_key item_key supplier
day Sales Fact Table item_name supplier_key
day_of_the_week brand supplier_type
time_key type
month
quarter item_key supplier_key
year
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province
Measures
country
18
Example of Fact Constellation
time
time_key Shipping Fact Table
day item
day_of_the_week Sales Fact Table item_key time_key
month item_name
quarter time_key brand item_key
year type shipper_key
item_key supplier_type
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
19
A Data Mining Query Language,
DMQL: Language Primitives
◼ Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
◼ Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
◼ Special Case (Shared Dimension Tables)
◼ First time as “cube definition”

◼ define dimension <dimension_name> as

<dimension_name_first_time> in cube
<cube_name_first_time>

20
Defining a Star Schema in DMQL

define cube sales_star [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

21
Defining a Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street,
city(city_key, province_or_state, country))
22
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

23
Measures: Three Categories
Measure: a function evaluated on aggregated data
corresponding to given dimension-value pairs.
Measures can be:
◼ distributive: if the measure can be calculated in a
distributive manner.
◼ E.g., count(), sum(), min(), max().
◼ algebraic: if it can be computed from arguments obtained
by applying distributive aggregate functions.
◼ E.g., avg()=sum()/count(), min_N(), standard_deviation().
◼ holistic: if it is not algebraic.
◼ E.g., median(), mode(), rank().

24
Measures: Three Categories

◼ Distributive and algebraic


measures are ideal for data
cubes.
◼ Calculated measures at lower
levels can be used directly at
higher levels.
◼ Holistic measures can be
difficult to calculate efficiently.
◼ Holistic measures could often
be efficiently approximated.

25
Browsing a Data Cube

◼ Visualization
◼ OLAP capabilities
◼ Interactive manipulation
26
A Concept Hierarchy
• Concept hierarchies allow data to be handled
at varying levels of abstraction

Dimensions: Product, Location, Time


Hierarchical summarization paths

Industry Region Year


Product

Category Country Quarter

Product City Month Week

Office Day
Month
27
Typical OLAP Operations (Fig 2.10)
◼ Roll up (drill-up): summarize data
◼ by climbing up concept hierarchy or by dimension reduction
◼ Drill down (roll down): reverse of roll-up
◼ from higher level summary to lower level summary or detailed
data, or introducing new dimensions
◼ Slice and dice:
◼ project and select
◼ Pivot (rotate):
◼ reorient the cube, visualization, 3D to series of 2D planes.
◼ Other operations
◼ drill across: involving (across) more than one fact table
◼ drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
28
Querying Using a Star-Net Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
Each circle is
ORDER called a footprint
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT

REGION
DIVISION
Location
Promotion Organization
29
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ Data warehouse implementation

◼ Further development of data cube technology

◼ From data warehousing to data mining

30
Data Warehouse Design Process

◼ Top-down, bottom-up approaches or a combination of both


◼ Top-down: Starts with overall design and planning (mature)

◼ Bottom-up: Starts with experiments and prototypes (rapid)

◼ From software engineering point of view


◼ Waterfall: structured and systematic analysis at each step before
proceeding to the next
◼ Spiral: rapid generation of increasingly functional systems, quick
modifications, timely adaptation of new designs and technologies
◼ Typical data warehouse design process
◼ Choose a business process to model, e.g., orders, invoices, etc.

◼ Choose the grain (atomic level of data) of the business process

◼ Choose the dimensions that will apply to each fact table record

◼ Choose the measure that will populate each fact table record

31
Multi-Tiered Architecture

Monitor
& OLAP Server
other Metadata
sources Integrator

Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


32
Three Data Warehouse Models
◼ Enterprise warehouse
◼ collects all of the information about subjects spanning

the entire organization


◼ Data Mart
◼ a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to


specific, selected groups, such as marketing data mart
◼ Independent vs. dependent (directly from warehouse) data mart
◼ Virtual warehouse
◼ A set of views over operational databases

◼ Only some of the possible summary views may be

materialized
33
OLAP Server Architectures
◼ Relational OLAP (ROLAP)
◼ Use relational or extended-relational DBMS to store and

manage warehouse data


◼ Include optimization of DBMS backend and additional tools

and services
◼ greater scalability

◼ Multidimensional OLAP (MOLAP)


◼ Array-based multidimensional storage engine (sparse

matrix techniques)
◼ fast indexing to pre-computed summarized data

◼ Hybrid OLAP (HOLAP)


◼ User flexibility (low level: relational, high-level: array)

◼ Specialized SQL servers


◼ specialized support for SQL queries over star/snowflake

schemas
34
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ Data warehouse implementation

◼ Further development of data cube technology

◼ From data warehousing to data mining

35
Efficient Data Cube Computation

◼ Data cube can be viewed as a lattice of cuboids


◼ The bottom-most cuboid is the base cuboid
◼ The top-most cuboid (apex) contains only one cell
◼ How many cuboids in an n-dimensional cube with L
levels? n
T =  ( Li +1)
i =1

◼ Materialization of data cube


◼ Materialize every (cuboid) (full materialization), none
(no materialization), or some (partial materialization)
◼ Selection of which cuboids to materialize
◼ Based on size, sharing, access frequency, etc.
36
Cube Operation
◼ Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
◼ Transform it into a SQL-like language (with a new operator cube by,
introduced by Gray et al.’96) ()
SELECT item, city, year, SUM (amount)
FROM SALES (city) (item) (year)
CUBE BY item, city, year
◼ Need compute the following Group-Bys
(date, product, customer), (city, item) (city, year) (item, year)
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
() SELECT item, city, year, SUM (city, item, year)
(amount)
FROM SALES
GROUP BY item, year 37
Cube Computation: ROLAP vs. MOLAP

◼ ROLAP-based cubing algorithms


◼ Key-based addressing
◼ Sorting, hashing, and grouping operations are applied to the
dimension attributes to reorder and cluster related tuples
◼ Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
◼ MOLAP-based cubing algorithms
◼ Direct array addressing
◼ Partition the array into chunks that fit the memory
◼ Compute aggregates by visiting cube chunks
◼ Possible to exploit ordering of chunks for faster calculation

38
Multiway Array Aggregation for MOLAP
◼ Partition arrays into chunks (a small subcube which fits in memory).
◼ Compressed sparse array addressing: (chunk_id, offset)
◼ Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A 39
Multiway Array Aggregation for MOLAP

C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
1 2 3 4
b0
After scan {1,2,3,4}:
a0 a1 a2 a3
A • b0c0 chunk is computed
• a0c0 and a0b0 are not
computed

40
Multiway Array Aggregation for MOLAP
We need to keep a We need to keep 4
single b-c chunk in a-c chunks in
memory memory

C c3 61 62 63 64
c2 45 46 47 48 After scan 1-13:
c1 29 30 31 32
c0
B13
• a0c0 and b0c0
14 15 16 60
b3 44 chunks are
B b2 9
28 56 computed
40
24
b1 5
36
52 • a0b0 is not
20 computed (we will
b0 1 2 3 4
need to scan 1-49)
a0 a1 a2 a3
A
We need to keep 16
a-b chunks in
memory

41
Multiway Array Aggregation for MOLAP

◼ Method: the planes should be sorted and computed


according to their size in ascending order.
◼ The proposed scan is optimal if |C|>|B|>|A|

◼ See the details of Example 2.12 (pp. 75-78)

◼ MOLAP cube computation is faster than ROLAP


◼ Limitation of MOLAP: computing well only for a small
number of dimensions
◼ If there are a large number of dimensions use the
iceberg cube computation: process only “dense” chunks

42
Indexing OLAP Data: Bitmap Index
◼ Suitable for low cardinality domains
◼ Index on a particular column
◼ Each value in the column has a bit vector: bit-op is fast
◼ The length of the bit vector: # of records in the base table
◼ The i-th bit is set if the i-th row of the base table has the value
for the indexed column

Base table Index on Region Index on Type


Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
43
Indexing OLAP Data: Join Indices
◼ Join index materializes relational join and speeds
up relational join — a rather costly operation
◼ In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
◼ E.g. fact table: Sales and two dimensions
location and item
◼ A join index on location is a list of pairs

<loc_name,T_id> sorted by location


◼ A join index on location-and-item is a list

of triples <loc_name,item_name, T_id>


sorted by location and item names
◼ Search of a join index can still be slow
◼ Bitmapped join index allows speed-up by using
bit vectors instead of dimension attribute names

44
Online Aggregation

◼ Consider an aggregate query:


“finding the average sales by state“
◼ Can we provide the user with some information before
the exact average is computed for all states?
◼ Solution: show the current “running average” for each state as
the computation proceeds.
◼ Even better, if we use statistical techniques and sample tuples
to aggregate instead of simply scanning the aggregated table,
we can provide bounds such as “the average for Wisconsin is
2000±102 with 95% probability.

45
Efficient Processing of OLAP Queries

◼ Determine which operations should be performed on the


available cuboids:
◼ transform drill, roll, etc. into corresponding SQL and/or
OLAP operations, e.g, dice = selection + projection
◼ Determine to which materialized cuboid(s) the relevant
operations should be applied.
◼ Exploring indexing structures and compressed vs. dense
array structures in MOLAP (trade-off between indexing and
storage performance)
46
Metadata Repository
◼ Meta data is the data defining warehouse objects. It has the following
kinds
◼ Description of the structure of the warehouse

◼ schema, view, dimensions, hierarchies, derived data definitions, data


mart locations and contents
◼ Operational meta-data
◼ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring information
(warehouse usage statistics, error reports, audit trails)
◼ The algorithms used for summarization
◼ The mapping from operational environment to the data warehouse
◼ Data related to system performance
◼ warehouse schema, view and derived data definitions
◼ Business data
◼ business terms and definitions, ownership of data, charging policies
47
Data Warehouse Back-End Tools and
Utilities
◼ Data extraction:
◼ get data from multiple, heterogeneous, and external
sources
◼ Data cleaning:
◼ detect errors in the data and rectify them when
possible
◼ Data transformation:
◼ convert data from legacy or host format to warehouse
format
◼ Load:
◼ sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
◼ Refresh
◼ propagate the updates from the data sources to the
warehouse
48
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ Data warehouse implementation

◼ Further development of data cube technology

◼ From data warehousing to data mining

49
Discovery-Driven Exploration of Data
Cubes
◼ Hypothesis-driven: exploration by user, huge search space
◼ Discovery-driven (Sarawagi et al.’98)
◼ pre-compute measures indicating exceptions, guide user in the
data analysis, at all levels of aggregation
◼ Exception: significantly different from the value anticipated,
based on a statistical model
◼ Visual cues such as background color are used to reflect the
degree of exception of each cell
◼ Computation of exception indicator can be overlapped with cube
construction

50
Examples: Discovery-Driven Data Cubes

51
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ Data warehouse implementation

◼ Further development of data cube technology

◼ From data warehousing to data mining

52
Data Warehouse Usage
◼ Three kinds of data warehouse applications
◼ Information processing
◼ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
◼ Analytical processing
◼ multidimensional analysis of data warehouse data
◼ supports basic OLAP operations, slice-dice, drilling, pivoting
◼ Data mining
◼ knowledge discovery from hidden patterns
◼ supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
◼ Differences among the three tasks
53
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)
◼ Why online analytical mining?
◼ High quality of data in data warehouses
◼ DW contains integrated, consistent, cleaned data

◼ Available information processing structure surrounding data


warehouses
◼ ODBC, OLEDB, Web accessing, service facilities, reporting

and OLAP tools


◼ OLAP-based exploratory data analysis
◼ mining with drilling, dicing, pivoting, etc.

◼ On-line selection of data mining functions


◼ integration and swapping of multiple mining functions,

algorithms, and tasks.


◼ Architecture of OLAM
54
Summary
◼ Data warehouse
◼ A subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process
◼ A multi-dimensional model of a data warehouse
◼ Star schema, snowflake schema, fact constellations
◼ A data cube consists of dimensions & measures
◼ OLAP operations: drilling, rolling, slicing, dicing and pivoting
◼ OLAP servers: ROLAP, MOLAP, HOLAP
◼ Efficient computation of data cubes
◼ Partial vs. full vs. no materialization
◼ Multiway array aggregation
◼ Bitmap index and join index implementations
◼ Further development of data cube technology
◼ Discovery-drive and multi-feature cubes
◼ From OLAP to OLAM (on-line analytical mining)

55
Fall 2004, CIS, Temple University

CIS527: Data Warehousing, Filtering, and


Mining

Lecture 3

• Data Warehousing and OLAP Technology for Data Mining

Lecture slides taken/modified from:


– Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)
1
What is Data?
• Collection of data objects and
their attributes Attributes

• An attribute is a property or
characteristic of an object Tid Refund Marital Taxable
Status Income Cheat
– Examples: eye color of a
person, temperature, etc. 1 Yes Single 125K No
– Attribute is also known as 2 No Married 100K No
variable, field, characteristic, 3 No Single 70K No
or feature
4 Yes Married 120K No
• A collection of attributes 5 No Divorced 95K Yes
describe an object Objects
6 No Married 60K No
– Object is also known as
7 Yes Divorced 220K No
record, point, case, sample,
entity, or instance 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute

• Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of


values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
3
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts

4
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
5
Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests

Interval For interval attributes, the calendar dates, mean, standard


differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
(+, - )

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current

6
Attribute Transformation Comments
Level

Nominal Any permutation of values If all employee ID numbers were


reassigned, would it make any
difference?

Ordinal An order preserving change of values, i.e., An attribute encompassing the


new_value = f(old_value) notion of good, better best can be
where f is a monotonic function. represented equally well by the
values {1, 2, 3} or by { 0.5, 1, 10}.

Interval new_value =a * old_value + b where a and Thus, the Fahrenheit and Celsius
b are constants temperature scales differ in terms
of where their zero value is and the
size of a unit (degree).

Ratio new_value = a * old_value Length can be measured in meters


or feet.

7
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
8
Types of Data Sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Multi-Relational
– Star or snowflake schema
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data

9
Important Characteristics of Structured
Data
– Dimensionality
• Number of attributes each object is described with
• Challenge: high dimensionality (curse of dimensionality)

– Sparsity
• Sparse data: values of most attributes are zero
• Challenge: sparse data call for special handling

– Resolution
• Data properties often could be measured with different
resolutions
• Challenge: decide on the most appropriate resolution (e.g.
“Can’t See the Forest for the Trees”)
10
Record Data
• Data that consists of a collection of records,
each of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
11
Data Matrix
• If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there


are m rows, one for each object, and n columns, one for each
attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

12
Document Data
• Each document becomes a ‘term’ vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
13
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– E.g., consider a grocery store. The set of products
purchased by a customer during one shopping trip
constitute a transaction, while the individual products
that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk 14
Multi-Relational Data
• Attributes are objects themselves

15
Graph Data
• Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
2 <li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5

16
Chemical Data
• Benzene Molecule: C6H6

17
Ordered Data
• Sequences of transactions
Items/Events

An element of
18
the sequence
Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

19
Ordered Data
• Spatial-Temporal Data

Average Monthly
Temperature of
land and ocean

20
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


– Noise and outliers
– missing values
– duplicate data

21
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


22
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

23
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
24
Duplicate Data
• Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues
25
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation

26
Aggregation
• Combining two or more attributes (or objects)
into a single attribute (or object)

• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability

27
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.

• Statisticians sample because obtaining the entire set of


data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.

28
Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative

– A sample is representative if it has approximately the


same property (of interest) as the original set of data

29
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item

• Sampling without replacement


– As each item is selected, it is removed from the population

• Sampling with replacement


– Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked up
more than once

• Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition 30
Sample Size

8000 points 2000 Points 500 Points

31
Sample Size
• What sample size is necessary to get at least one
object from each of 10 groups.

32
Curse of Dimensionality
• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

• Definitions of density and


distance between points,
which is critical for
clustering and outlier
detection, become less • Randomly generate 500 points
meaningful • Compute difference between max and
min distance between any pair of
points
33
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
34
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the
largest amount of variation in data

x2

x1
35
Dimensionality Reduction: PCA
• Find the eigenvectors of the covariance matrix
• The eigenvectors define the new space

x2

x1 36
Dimensionality Reduction: ISOMAP
By: Tenenbaum, de Silva,
Langford (2000)

• Construct a neighbourhood graph


• For each pair of points in the graph, compute the
shortest path distances – geodesic distances
37
Feature Subset Selection
• Another way to reduce dimensionality of data

• Redundant features
– duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid

• Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA 38
Feature Subset Selection
• Techniques:
– Brute-force approch:
• Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
• Feature selection occurs naturally as part of the data mining
algorithm
– Filter approaches:
• Features are selected before data mining algorithm is run
– Wrapper approaches:
• Use the data mining algorithm as a black box to find best
subset of attributes

39
Feature Creation
• Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes

• Three general methodologies:


– Feature Extraction
• domain-specific
– Mapping Data to New Space
– Feature Construction
• combining features

40
Example: Mapping Data to a New
Space
• Fourier transform
• Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency


41
Discretization and Binarization
• Different data mining applications require
specific data formats
– Categorical only (discretization)
– Binary only (binarization)
– Interval/Ratio only (binarization)
• Discretization: transforming interval attribute into
categorical
• Binarization: transforming non-binary attribute
into a set of binary attributes

42
Discretization Using Class Labels
• Entropy based approach

3 categories for both x and y 5 categories for both x and y


43
Attribute Transformation
• A function that maps the entire set of values of a given
attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization

44
Fall 2004, CIS, Temple University

CIS527: Data Warehousing, Filtering, and


Mining

Lecture 4

• Tutorial: Connecting SQL Server to Matlab using Database Matlab


Toolbox
• Association Rule MIning

Lecture slides taken/modified from:


– Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
– Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)
Motivation: Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Applications: Association Rule Mining

• *  Maintenance Agreement
– What the store should do to boost Maintenance
Agreement sales
• Home Electronics  *
– What other products should the store stocks up?
• Attached mailing in direct marketing
• Detecting “ping-ponging” of patients
• Marketing and Sales Promotion
• Supermarket shelf management
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items TID Items

• Support count () 1 Bread, Milk


– Frequency of occurrence of an itemset 2 Bread, Diaper, Beer, Eggs
– E.g. ({Milk, Bread,Diaper}) = 2 3 Milk, Diaper, Beer, Coke
• Support 4 Bread, Milk, Diaper, Beer
– Fraction of transactions that contain an 5 Bread, Milk, Diaper, Coke
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
• Association Rule TID Items

– An implication expression of the form 1 Bread, Milk


X → Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper} → {Beer} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
• Rule Evaluation Metrics
– Support (s) Example:
• Fraction of transactions that contain {Milk , Diaper }  Beer
both X and Y
– Confidence (c)
 (Milk , Diaper, Beer) 2
• Measures how often items in Y s= = = 0.4
appear in transactions that |T| 5
contain X
 (Milk, Diaper, Beer) 2
c= = = 0.67
 (Milk , Diaper ) 3
Association Rule Mining Task

• Given a set of transactions T, the goal of


association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d − k 
R =       
d −1 d −k

 k   j 
k =1 j =1

= 3 − 2 +1
d d +1

If d=6, R = 602 rules


Mining Association Rules: Decoupling
TID Items Example of Rules:
1 Bread, Milk {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules

• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is still


computationally expensive
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M

• Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases
– Use a subsample of N transactions
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
Reducing Number of Candidates: Apriori
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

• Apriori principle holds due to the following property


of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2
or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm

• Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Apriori: Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support of
each candidate itemset
– To reduce the number of comparisons, store the candidates in a
hash structure
• Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Transactions Hash Structure


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
Apriori: Implementation Using Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node
(if number of candidate itemsets exceeds max leaf size, split the node)

Hash function 234


3,6,9 567
1,4,7
145 136
2,5,8 345 356 367
357 368
124 159 689
125
457 458
Apriori: Implementation Using Hash Tree
1 2 3 5 6 transaction

1+ 2356
2+ 356
12+ 356
3+ 56
13+ 56
234
15+ 6
567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Apriori: Alternative Search Methods

• Traversal of Itemset Lattice


– General-to-specific vs Specific-to-general
Frequent
itemset Frequent
border null null itemset null
border

.. .. ..
.. .. ..
Frequent
{a1,a2,...,an} {a1,a2,...,an} itemset {a1,a2,...,an}
border
(a) General-to-specific (b) Specific-to-general (c) Bidirectional
Apriori: Alternative Search Methods

• Traversal of Itemset Lattice


– Breadth-first vs Depth-first

(a) Breadth first (b) Depth first


Bottlenecks of Apriori

• Candidate generation can result in huge


candidate sets:
– 104 frequent 1-itemset will generate 107 candidate 2-
itemsets
– To discover a frequent pattern of size 100, e.g., {a1,
a2, …, a100}, one needs to generate 2100 ~ 1030
candidates.
• Multiple scans of database:
– Needs (n +1 ) scans, n is the length of the longest
pattern
ECLAT: Another Method for Frequent Itemset
Generation
• ECLAT: for each item, store a list of transaction
ids (tids); vertical data layout

Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
TID-list
ECLAT: Another Method for Frequent Itemset
Generation
• Determine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
A B AB
1 1 1

 →
4 2 5
5 5 7
6 7 8
7 8
8 10
9
• 3 traversal approaches:
– top-down, bottom-up and hybrid
• Advantage: very fast support counting
• Disadvantage: intermediate tid-lists may become too
large for memory
FP-growth: Another Method for Frequent
Itemset Generation

• Use a compressed representation of the


database using an FP-tree

• Once an FP-tree has been constructed, it uses a


recursive divide-and-conquer approach to mine
the frequent itemsets
FP-Tree Construction
null
After reading TID=1:
TID Items
1 {A,B} A:1
2 {B,C,D}
3 {A,C,D,E}
B:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} After reading TID=2:
null
7 {B,C}
8 {A,B,C} A:1 B:1
9 {A,B,D}
10 {B,C,E}
B:1 C:1

D:1
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1

Header table D:1


C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
FP-growth

Build conditional pattern


null base for E:
P = {(A:1,C:1,D:1),
B:3 (A:1,D:1),
A:7
(B:1,C:1)}
Recursively apply FP-
B:5 C:3 growth on P
C:1 D:1
C:3
D:1
D:1 E:1
D:1 E:1
E:1
D:1
FP-growth
Conditional tree for E:

Conditional Pattern base


null for E:
P = {(A:1,C:1,D:1,E:1),
B:1 (A:1,D:1,E:1),
A:2
(B:1,C:1,E:1)}
Count for E is 3: {E} is
C:1 frequent itemset
C:1 D:1
Recursively apply FP-
growth on P
D:1 E:1
E:1
E:1
FP-growth
Conditional tree for D
within conditional tree
for E:
Conditional pattern base
null for D within conditional
base for E:
P = {(A:1,C:1,D:1),
A:2
(A:1,D:1)}
Count for D is 2: {D,E} is
C:1 D:1 frequent itemset
Recursively apply FP-
growth on P
D:1
FP-growth
Conditional tree for C
within D within E:
Conditional pattern base
null for C within D within E:
P = {(A:1,C:1)}
A:1 Count for C is 1: {C,D,E}
is NOT frequent itemset

C:1
FP-growth
Conditional tree for A
within D within E:
Count for A is 2: {A,D,E}
null is frequent itemset
Next step:
A:2
Construct conditional tree
C within conditional tree
E
Continue until exploring
conditional tree for A
(which has only node A)
Benefits of the FP-tree Structure
• Performance study shows
– FP-growth is an order of
magnitude faster than
Apriori, and is also faster 100

than tree-projection 90 D1 FP-grow th runtime


D1 Apriori runtime
80

• Reasoning 70

Run time(sec.)
60

– No candidate generation, 50

no candidate test 40

30

– Use compact data structure 20

10
– Eliminate repeated 0
0 0.5 1 1.5 2 2.5 3
database scan Support threshold(%)

– Basic operation is counting


and FP-tree building
Complexity of Association Mining
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and
I/O costs may also increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
• Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
Compact Representation of Frequent
Itemsets
• Some itemsets are redundant because they have
identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

10 
• Number of frequent itemsets = 3    
10

k
k =1

• Need a compact representation


Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets Border
ABCD
E
Closed Itemset

• Problem with maximal frequent itemsets:


– Support of their subsets is not known – additional DB scans are
needed
• An itemset is closed if none of its immediate supersets
has the same support as the itemset
Itemset Support
{A} 4
TID Items Itemset Support
{B} 5
1 {A,B} {A,B,C} 2
{C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A,B} 4
4 {A,B,D} {B,C,D} 2
{A,C} 2 {A,B,C,D} 2
5 {A,B,C,D}
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
Maximal vs Closed Frequent Itemsets
Minimum support = 2 null
Closed but
not maximal
124 123 1234 245 345 Closed and
A B C D E maximal

12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

TID Items
# Closed = 9
1 ABC 2 4
ABCD ABCE ABDE ACDE BCDE # Maximal = 4
2 ABCD
3 BCE
ABCDE
4 ACDE
5 DE
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets
Rule Generation

• Given a frequent itemset L, find all non-empty


subsets f  L such that f → L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,

• If |L| = k, then there are 2k – 2 candidate


association rules (ignoring L →  and  → L)
Rule Generation

• How to efficiently generate rules from frequent


itemsets?
– In general, confidence does not have an anti-
monotone property
c(ABC →D) can be larger or smaller than c(AB →D)

– But confidence of rules generated from the same


itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:

c(ABC → D)  c(AB → CD)  c(A → BCD)

• Confidence is anti-monotone w.r.t. number of items on the


RHS of the rule
Rule Generation
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules
Presentation of Association Rules (Table Form)
Visualization of Association Rule Using Plane Graph
Visualization of Association Rule Using Rule Graph
Fall 2004, CIS, Temple University

CIS527: Data Warehousing, Filtering, and


Mining

Lecture 6

• Clustering

Lecture slides taken/modified from:


– Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
– Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)

1
General Applications of Clustering

• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
3
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an
earth observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to
their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
4
What Is Good Clustering?

• A good clustering method will produce high quality


clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.

5
Requirements of Clustering in Data
Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
6
Data Structures in Clustering

 x11 ... x1f ... x1p 


 
• Data matrix  ... ... ... ... ... 
x ... x if ... x ip 
– (two modes)  i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 

 0 
 d(2,1) 
• Dissimilarity matrix  0 
 d(3,1) d ( 3,2) 0 
– (one mode)  
 : : : 
d ( n,1) d ( n,2) ... ... 0
7
Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
8
Interval-valued variables

• Standardize data
– Calculate the mean squared deviation:
sf = 1
n (| x1 f
− m f
|2 + | x − m |2 +...+ | x − m |2)
2f f nf f

where mf = 1n (x1 f + x2 f + ... + xnf )


.

– Calculate the standardized measurement (z-score)


xif − m f
zif = sf
• Using mean absolute deviation could be more robust
than using standard deviation

9
Similarity and Dissimilarity Between
Objects
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Some popular ones include: Minkowski distance:

d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
10
Similarity and Dissimilarity Between
Objects
• If q = 2, d is Euclidean distance:

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
11
Mahalanobis Distance

mahalanobi s( p, q) = ( p − q) −1 ( p − q)T


 is the covariance matrix of
the input data X

1 n
 j ,k =  ( X ij − X j )( X ik − X k )
n − 1 i =1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


12
Mahalanobis Distance
Covariance Matrix:

0.3 0.2
= 
 0.2 0.3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

13
Cosine Similarity

• If d1 and d2 are two document vectors, then


cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

14
Correlation Measure

Scatter plots
showing the
similarity from
–1 to 1.

15
Binary Variables
• A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p
• Simple matching coefficient (invariant, if the binary variable is
symmetric):
d (i, j) = b+c
a +b+c + d
• Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d (i, j) = b+c
a +b+c 16
Dissimilarity between Binary Variables
• Example

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2 17
Nominal Variables

• A generalization of the binary variable in that it can take


more than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables

d (i, j) = p −
p
m

• Method 2: use a large number of binary variables


– creating a new binary variable for each of the M nominal states

18
Ordinal Variables
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank
• Can be treated like interval-scaled
– replacing xif by their rank rif {1,..., M f }

– map the range of each variable onto [0, 1] by replacing i-th object
in the f-th variable by
rif −1
zif =
M f −1
– compute the dissimilarity using methods for interval-scaled
variables

19
Ratio-Scaled Variables

• Ratio-scaled variable: a positive measurement on a


nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
• Methods:
– treat them like interval-scaled variables — not a good choice!
(why?)
– apply logarithmic transformation
yif = log(xif)
– treat them as continuous ordinal data treat their rank as interval-
scaled.

20
Variables of Mixed Types
• A database may contain all the six types of variables
– symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio.
• One may use a weighted formula to combine their
effects.
 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
z =
rif − 1
• and treat zif as interval-scaled if M −1
f
21
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

22
Other Distinctions Between Sets of
Clusters
• Exclusive versus non-exclusive
– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete
– In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities

23
Types of Clusters
• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual

• Described by an Objective Function


24
Types of Clusters: Well-Separated
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.

3 well-separated clusters
25
Types of Clusters: Center-Based

• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most
“representative” point of a cluster

4 center-based clusters
26
Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or


Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.

8 contiguous clusters
27
Types of Clusters: Density-Based

• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters
28
Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters


– Finds clusters that share some common property or represent
a particular concept.
.

2 Overlapping Circles
29
Major Clustering Approaches

• Partitioning algorithms: Construct various partitions and


then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
30
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple

31
K-means Clustering – Details
• Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the
cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
32
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


33
• Importance of choosing initial centroids
Evaluating K-means Clusters

• Most common measure is Sum of Squared Error (SSE)


– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE =   dist 2 (mi , x )
i =1 xCi

– x is a data point in cluster Ci and mi is the representative point for


cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest
error
– One easy way to reduce SSE is to increase K, the number of
clusters
• A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
34
Solutions to Initial Centroids Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine
initial centroids
• Select more than k initial centroids and then select
among these initial centroids
– Select most widely separated
• Postprocessing
• Bisecting K-means
– Not as susceptible to initialization issues

35
Handling Empty Clusters

• Basic K-means algorithm can yield empty


clusters

• Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the highest SSE
– If there are several empty clusters, the above can be
repeated several times.

36
Pre-processing and Post-processing

• Pre-processing
– Normalize the data
– Eliminate outliers

• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
– Can use these steps during the clustering process
• ISODATA 37
Bisecting K-means

• Bisecting K-means algorithm


– Variant of K-means that can produce a partitional or a
hierarchical clustering

38
Bisecting K-means Example

39
Limitations of K-means

• K-means has problems when clusters are of


differing
– Sizes
– Densities
– Non-globular shapes

• K-means has problems when the data contains


outliers.

40
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

41
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

42
Limitations of K-means: Non-globular
Shapes

Original Points K-means (2 Clusters)

43
Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.
44
Overcoming K-means Limitations

Original Points K-means Clusters

45
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang’98)
– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
• Handling a mixture of categorical and numerical data: k-
prototype method

46
The K-Medoids Clustering Method

• Find representative objects, called medoids, in clusters


• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale well
for large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
– draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995) 47
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits

6 5
0.2

4
0.15 3 4
2
5
0.1
2

0.05
1
3 1
0
1 3 2 5 4 6

48
Strengths of Hierarchical Clustering

• Do not have to assume any particular number of


clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level

• They may correspond to meaningful taxonomies


– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)

49
Hierarchical Clustering

• Two main types of hierarchical clustering


– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)

• Traditional hierarchical algorithms use a similarity or


distance matrix
– Merge or split one cluster at a time
50
Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique


• Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

• Key operation is the computation of the proximity of


two clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
51
Starting Situation
• Start with clusters of individual points and a proximity
matrix
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12

52
Intermediate Situation
• After some merging steps, we have some clusters

C1 C2 C3 C4 C5
C1

C2
C3 C3
C4
C4
C5

C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12

53
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 54
p12
After Merging
• The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4

C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?

Proximity Matrix
C1

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
55
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2

p3

p4

p5
• MIN .
• MAX .

• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 56
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
• MIN .
• MAX .

• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 57
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
• MIN .
• MAX .

• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 58
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
• MIN .
• MAX .

• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 59
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

  p2

p3

p4

p5
• MIN .
• MAX .

• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 60
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3

61
Hierarchical Clustering: Time and Space
requirements
• O(N2) space since it uses the proximity matrix.
– N is the number of points.

• O(N3) time in many cases


– There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time for
some approaches

62
Hierarchical Clustering: Problems and
Limitations
• Once a decision is made to combine two
clusters, it cannot be undone

• No objective function is directly minimized

• Different schemes have problems with one or


more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex
shapes
– Breaking large clusters
63
MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q)
such that one point (p) is in the current tree but the other (q) is
not
– Add q to the tree and put an edge between p and q

64
MST: Divisive Hierarchical Clustering

• Use MST for constructing hierarchy of clusters

65
More on Hierarchical Clustering Methods

• Major weakness of agglomerative clustering methods


– do not scale well: time complexity of at least O(n2), where n is the
number of total objects
– can never undo what was done previously
• Integration of hierarchical with distance-based clustering
– BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
– CURE (1998): selects well-scattered points from the cluster and
then shrinks them towards the center of the cluster by a specified
fraction
– CHAMELEON (1999): hierarchical clustering using dynamic
modeling

66
One Alternative: BIRCH

• Birch: Balanced Iterative Reducing and Clustering using


Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
• Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
– Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
– Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
• Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the
order of the data record.
67
Density-Based Clustering Methods

• Clustering based on density (local cluster criterion),


such as density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98)

68
DBSCAN

• DBSCAN is a density-based algorithm.


• Definitions:
– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number


of points (MinPts) within Eps
• These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in


the neighborhood of a core point

– A noise point is any point that is not a core point or a border


point.

69
DBSCAN: Core, Border, and Noise Points

70
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points

71
DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise

Eps = 10, MinPts = 4


72
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
73
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
74
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther
distance
• So, plot sorted distance of every point to its kth
nearest neighbor

75
Graph-Based Clustering

• Graph-Based clustering uses the proximity


graph
– Start with the proximity matrix
– Consider each point as a node in a graph
– Each edge between two nodes has a weight which is
the proximity between the two points
– Initially the proximity graph is fully connected
– MIN (single-link) and MAX (complete-link) can be
viewed as starting with this graph
• In the simplest case, clusters are connected
components in the graph.
76
Graph-Based Clustering: Sparsification

• Clustering may work better


– Sparsification techniques keep the connections to the most
similar (nearest) neighbors of a point while breaking the
connections to less similar points.
– The nearest neighbors of a point tend to belong to the same
class as the point itself.
– This reduces the impact of noise and outliers and sharpens
the distinction between clusters.

• Sparsification facilitates the use of graph


partitioning algorithms (or algorithms based
on graph partitioning algorithms.
– Chameleon and Hypergraph-based Clustering
77
Sparsification in the Clustering Process

78
Limitations of Current Merging
Schemes
(a)

(b)

(c)

(d)

Closeness schemes Average connectivity schemes


will merge (a) and (b) will merge (c) and (d)

79
Model-Based Clustering Methods

• Attempt to optimize the fit between the data and some


mathematical model
• Statistical and AI approach
– Conceptual clustering
• A form of clustering in machine learning
• Produces a classification scheme for a set of unlabeled objects
• Finds characteristic description for each concept (class)
– COBWEB (Fisher’87)
• A popular a simple method of incremental conceptual learning
• Creates a hierarchical clustering in the form of a classification tree
• Each node refers to a concept and contains a probabilistic description
of that concept

80
Cluster Validity
• For supervised classification we have a variety of
measures to evaluate how good our model is
– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to


evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?


– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
81
Clusters found in Random Data
1 1

0.9 0.9

0.8 0.8

0.7 0.7

Random 0.6 0.6 DBSCAN


Points 0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1

0.9 0.9

0.8 0.8
K-means Complete
0.7 0.7
Link
0.6 0.6

0.5 0.5
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
82
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects
of cluster validity, are classified into the following three types.
– External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or
clusters.
• Often an external or internal index is used for this function, e.g., SSE or
entropy
• Sometimes these are referred to as criteria instead of indices
– However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
83
Internal Measures: Cohesion and
Separation
• Cluster Cohesion: Measures how closely related are
objects in a cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
WSS =   ( x − mi )2
i xC i
– Separation is measured by the between cluster sum of squares

BSS =  Ci (m − mi )2
i
• Where |Ci| is the size of cluster i
84
External Measures of Cluster Validity:
Entropy and Purity

85
Final Comment on Cluster Validity

“The validation of clustering structures is the most


difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only to
those true believers who have experience and
great courage.”

Algorithms for Clustering Data, Jain and Dubes

86
What Is Outlier Discovery?

• What are outliers?


– The set of objects are considerably dissimilar from the
remainder of the data
– Example: Sports: Michael Jordon, Wayne Gretzky, ...
• Problem
– Find top n outlier points
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis

87
Outlier Discovery:
Statistical Approach

Assume a model underlying distribution that generates


data set (e.g. normal distribution)
• Use discordancy tests depending on
– data distribution
– distribution parameter (e.g., mean, variance)
– number of expected outliers
• Drawbacks
– most tests are for single attribute
– In many cases, data distribution may not be known
88
Outlier Discovery: Distance-Based
Approach

• Introduced to counter the main limitations imposed by


statistical methods
– We need multi-dimensional analysis without knowing data
distribution.
• Distance-based outlier: outlier is an object O in a dataset
T such that at least a fraction p of the objects in T lies at
a distance greater than D from O
• Algorithms for mining distance-based outliers
– Index-based algorithm
– Nested-loop algorithm
– Cell-based algorithm

89
Outlier Discovery: Deviation-Based
Approach
• Identifies outliers by examinining the main
characteristics of objects in a group
• Objects that “deviate” from this description are
considered outliers
• sequential exception technique
– simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
• OLAP data cube technique
– uses data cubes to identify regions of anomalies in large
multidimensional data

90
Fall 2004, CIS, Temple University

CIS527: Data Warehousing, Filtering, and


Mining

Lecture 7

Decision Trees

Lecture slides taken from:


– Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Classification: Definition

Given a collection of records (training set )


– Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Examples of Classification Task

Predicting tumor cells as benign or malignant

Classifying credit card transactions


as legitimate or fraudulent

Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

Categorizing news stories as finance,


weather, entertainment, sports, etc

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Classification Techniques

Decision Tree based Methods


Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Apply Model to Test Data

Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Decision Tree Induction

Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No

General Procedure: 2 No Married 100K No

– If Dt contains records that 3 No Single 70K No


4 Yes Married 120K No
belong the same class yt, then t
5 No Divorced 95K Yes
is a leaf node labeled as yt 6 No Married 60K No
– If Dt is an empty set, then t is a 7 Yes Divorced 220K No
leaf node labeled by the default 8 No Single 85K Yes
class, yd 9 No Married 75K No

– If Dt contains records that 10


10 No Single 90K Yes

belong to more than one class, Dt


use an attribute test to split the
data into smaller subsets.
Recursively apply the ?
procedure to each subset.

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes

Refund Refund 9 No Married 75K No


Yes No Yes No 10 No Single 90K Yes
10

Don’t Don’t Marital


Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced

Don’t Taxable Don’t


Cheat Cheat
Cheat Income
< 80K >= 80K

Don’t Cheat
Cheat
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Tree Induction

Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?

– Determine when to stop splitting

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


How to Specify Test Condition?

Depends on attribute types


– Nominal
– Ordinal
– Continuous

Depends on number of ways to split


– 2-way split
– Multi-way split

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Splitting Based on Nominal Attributes

Multi-way split: Use as many partitions as distinct


values.
CarType
Family Luxury
Sports

Binary split: Divides values into two subsets.


Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Splitting Based on Ordinal Attributes

Multi-way split: Use as many partitions as distinct


values.
Size
Small Large
Medium

Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small,
{Large}
hay {Medium,
{Small}
Medium} Large}

Size
{Small,
What about this split? Large} {Medium}

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Splitting Based on Continuous Attributes

Different ways of handling


– Discretization to form an ordinal categorical
attribute
◆ Static – discretize once at the beginning
◆ Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)


◆ consider all possible splits and finds the best cut
◆ can be more compute intensive

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Tree Induction

Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?

– Determine when to stop splitting

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


How to determine the Best Split

Greedy approach:
– Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Measures of Node Impurity

Gini Index

Entropy

Misclassification error

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Measure of Impurity: GINI

Gini Index for a given node t :

GINI (t ) = 1 −  [ p( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Examples for computing GINI

GINI (t ) = 1 −  [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Splitting Based on GINI

Used in CART, SLIQ, SPRINT.


When a node p is split into k partitions (children), the
quality of split is computed as,
k
ni
GINI split =  GINI (i)
i =1 n

where, ni = number of records at child i,


n = number of records at node p.

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Binary Attributes: Computing GINI
Index

Splits into two partitions


Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini(Children)
= 0.194
C1 5 1 = 7/12 * 0.194 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Categorical Attributes: Computing Gini Index

For each distinct value, gather counts for each class in


the dataset
Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Continuous Attributes: Computing Gini Index

Use Binary Decisions based on one Tid Refund Marital Taxable


Status Income Cheat
value
1 Yes Single 125K No
Several Choices for the splitting value
2 No Married 100K No
– Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No
Each splitting value has a count matrix 5 No Divorced 95K Yes
associated with it 6 No Married 60K No

– Class counts in each of the 7 Yes Divorced 220K No

partitions, A < v and A  v 8 No Single 85K Yes


9 No Married 75K No
Simple method to choose best v
10 No Single 90K Yes
– For each v, scan the database to 10

gather count matrix and compute Taxable


its Gini index Income
> 80K?
– Computationally Inefficient!
Repetition of work. Yes No

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income
Các giá trị được 60 70 75 85 90 95 100 120 125 220
sắp xếp
Các vị trí phân chia 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Alternative Splitting Criteria based on INFO

Entropy at a given node t:


Entropy(t ) = − p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


– Measures homogeneity of a node.
◆ Maximum (log nc) when records are equally distributed
among all classes implying least information
◆ Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Examples for computing Entropy

Entropy(t ) = − p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Splitting Based on INFO...

Information Gain:
 n 
= Entropy ( p) − 
k
GAIN Entropy (i )  i

 n 
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Splitting Based on INFO...

Gain Ratio:
GAIN n n
GainRATIO = SplitINFO = −  log
Split k
i i
split
SplitINFO n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the


partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information
Gain
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Splitting Criteria based on Classification Error

Classification error at a node t :

Error (t ) = 1 − max P(i | t ) i

Measures misclassification error made by a node.


◆ Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
◆ Minimum (0.0) when all records belong to one class, implying
most interesting information

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Examples for Computing Error

Error (t ) = 1 − max P(i | t )


i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Comparison among Splitting Criteria

For a 2-class problem:

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Tree Induction

Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?

– Determine when to stop splitting

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Stopping Criteria for Tree Induction

Stop expanding a node when all the records


belong to the same class

Stop expanding a node when all the records have


similar attribute values

Early termination (to be discussed later)

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Decision Tree Based Classification

Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Example: C4.5

Simple depth-first construction.


Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
– Needs out-of-core sorting.

You can download the software from:


http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Practical Issues of Classification

Underfitting and Overfitting

Missing Values

Costs of Classification

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Underfitting and Overfitting (Example)

500 circular and 500


triangular data points.

Circular points:
0.5  sqrt(x12+x22)  1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Overfitting due to Noise

Decision boundary is distorted by noise point

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Notes on Overfitting

Overfitting results in decision trees that are more


complex than necessary

Training error no longer provides a good estimate


of how well the tree will perform on previously
unseen records

Need new ways for estimating errors

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Estimating Generalization Errors

Re-substitution errors: error on training ( e(t) )


Generalization errors: error on testing ( e’(t))
Methods for estimating generalization errors:
– Optimistic approach: e’(t) = e(t)
– Pessimistic approach:
◆ For each leaf node: e’(t) = (e(t)+0.5)
◆ Total errors: e’(T) = e(T) + N  0.5 (N: number of leaf nodes)
◆ For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%
– Reduced error pruning (REP):
◆ uses validation data set to estimate generalization
error

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Occam’s Razor

Given two models of similar generalization errors,


one should prefer the simpler model over the
more complex model

For complex models, there is a greater chance


that it was fitted accidentally by errors in data

Therefore, one should include model complexity


when evaluating a model

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


How to Address Overfitting

Pre-Pruning (Early Stopping Rule)


– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
◆ Stop if all instances belong to the same class
◆ Stop if all the attribute values are the same
– More restrictive conditions:
◆ Stop if number of instances is less than some user-specified
threshold
◆ Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
◆ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


How to Address Overfitting…

Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
– Can use MDL for post-pruning

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Example of Post-Pruning
Training Error (Before splitting) = 10/30

Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30

Class = No 10 Training Error (After splitting) = 9/30

Error = 10/30 Pessimistic error (After splitting)


= (9 + 4  0.5)/30 = 11/30
PRUNE!
A?

A1 A4
A2 A3

Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5


Class = No 4 Class = No 4 Class = No 1 Class = No 1

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Examples of Post-pruning

– Optimistic error? Case 1:

Don’t prune for both cases

C0: 11 C0: 2
C1: 3 C1: 4
– Pessimistic error?
Don’t prune case 1, prune case 2

– Reduced error pruning?


Case 2:

Depends on validation set

C0: 14 C0: 2
C1: 3 C1: 2

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Expressiveness

Decision tree provides expressive representation for


learning discrete-valued function
– But they do not generalize well to certain types of
Boolean functions
◆ Example: parity function:
– Class = 1 if there is an even number of Boolean attributes with truth
value = True
– Class = 0 if there is an odd number of Boolean attributes with truth
value = True
◆ For accurate modeling, must have a complete tree

Not expressive enough for modeling continuous variables


– Particularly when test condition involves only a single
attribute at-a-time

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Decision Boundary
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6
y

0.5 y < 0.47? y < 0.33?


0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Oblique Decision Trees

x+y<1

Class = + Class =

• Test condition may involve multiple attributes


• More expressive representation
• Finding optimal test condition is computationally expensive

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Model Evaluation

Metrics for Performance Evaluation


– How to evaluate the performance of a model?

Methods for Performance Evaluation


– How to obtain reliable estimates?

Methods for Model Comparison


– How to compare the relative performance
among competing models?

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Metrics for Performance Evaluation

Focus on the predictive capability of a model


– Rather than how fast it takes to classify or
build models, scalability, etc.
Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Metrics for Performance Evaluation…

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

Most widely-used metric:

a+d TP + TN
Đô chính xác = =
a + b + c + d TP + TN + FP + FN

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Limitation of Accuracy

Consider a 2-class problem


– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

If model predicts everything to be class 0,


accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Cost Matrix

PREDICTED CLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)


ACTUAL
CLASS Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Computing Cost of Classification

Cost PREDICTED CLASS


Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0

Model PREDICTED CLASS Model PREDICTED CLASS


M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Cost = 3910 Cost = 4255
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Cost vs Accuracy

Count PREDICTED CLASS Accuracy is proportional to cost if


1. C(Yes|No)=C(No|Yes) = q
Class=Yes Class=No
2. C(Yes|Yes)=C(No|No) = p
Class=Yes a b
ACTUAL N=a+b+c+d
CLASS Class=No c d
Accuracy = (a + d)/N

Cost PREDICTED CLASS


Cost = p (a + d) + q (b + c)
Class=Yes Class=No
= p (a + d) + q (N – a – d)
Class=Yes p q = q N – (q – p)(a + d)
ACTUAL
CLASS Class=No
= N [q – (q-p)  Accuracy]
q p

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Cost-Sensitive Measures
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c

Precision is biased towards C(Yes|Yes) & C(Yes|No)


Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
wa + w d
Weighted Accuracy = 1 4

wa + wb+ wc + w d
1 2 3 4

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Model Evaluation

Metrics for Performance Evaluation


– How to evaluate the performance of a model?

Methods for Performance Evaluation


– How to obtain reliable estimates?

Methods for Model Comparison


– How to compare the relative performance
among competing models?

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Methods for Performance Evaluation

How to obtain a reliable estimate of


performance?

Performance of a model may depend on other


factors besides the learning algorithm:
– Class distribution
– Cost of misclassification
– Size of training and test sets

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Learning Curve

Learning curve shows


how accuracy changes
with varying sample size
Requires a sampling
schedule for creating
learning curve:
Arithmetic sampling
(Langley, et al)
Geometric sampling
(Provost et al)

Effect of small sample size:


- Bias in the estimate
- Variance of estimate

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Methods of Estimation
Holdout
– Reserve 2/3 for training and 1/3 for testing
Random subsampling
– Repeated holdout
Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Stratified sampling
– oversampling vs undersampling
Bootstrap
– Sampling with replacement
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Model Evaluation

Metrics for Performance Evaluation


– How to evaluate the performance of a model?

Methods for Performance Evaluation


– How to obtain reliable estimates?

Methods for Model Comparison


– How to compare the relative performance
among competing models?

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Test of Significance

Given two models:


– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances

Can we say M1 is better than M2?


– How much confidence can we place on accuracy of
M1 and M2?
– Can the difference in performance measure be
explained as a result of random fluctuations in the test
set?

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Confidence Interval for Accuracy

Prediction can be regarded as a Bernoulli trial


– A Bernoulli trial has 2 possible outcomes
– Possible outcomes for prediction: correct or wrong
– Collection of Bernoulli trials has a Binomial distribution:
◆ x  Bin(N, p) x: number of correct predictions
◆ e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = Np = 50  0.5 = 25

Given x (# of correct predictions) or equivalently,


acc=x/N, and N (# of test instances),

Can we predict p (true accuracy of model)?

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Confidence Interval for Accuracy
Area = 1 - 
For large test sets (N > 30),
– acc has a normal distribution
with mean p and variance
p(1-p)/N

acc − p
P( Z  Z )
p(1 − p) / N
 /2 1− / 2

= 1− Z/2 Z1-  /2

Confidence Interval for p:


2  N  acc + Z  Z + 4  N  acc − 4  N  acc
2 2 2

p=  /2  /2

2( N + Z ) 2

 /2

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Confidence Interval for Accuracy

Consider a model that produces an accuracy of


80% when evaluated on 100 test instances:
– N=100, acc = 0.8 1- Z
– Let 1- = 0.95 (95% confidence)
0.99 2.58
– From probability table, Z/2=1.96
0.98 2.33
N 50 100 500 1000 5000 0.95 1.96

p(lower) 0.670 0.711 0.763 0.774 0.789 0.90 1.65

p(upper) 0.888 0.866 0.833 0.824 0.811

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Comparing Performance of 2 Models

Given two models, say M1 and M2, which is


better?
– M1 is tested on D1 (size=n1), found error rate = e1
– M2 is tested on D2 (size=n2), found error rate = e2
– Assume D1 and D2 are independent
– If n1 and n2 are sufficiently large, then

e1 ~ N (1 ,  1 )
e2 ~ N (2 ,  2 )
e (1 − e )
– Approximate: ˆ =
i i
i
n i

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


Comparing Performance of 2 Models

To test if performance difference is statistically


significant: d = e1 – e2
– d ~ N(dt,t) where dt is the true difference
– Since D1 and D2 are independent, their variance
adds up:

 =  +   ˆ + ˆ
2

t
2

1 2
2

1
2 2

e1(1 − e1) e2(1 − e2)


= +
n1 n2

– At (1-) confidence level, d = d  Z ˆ


t  /2 t

© Vipin Kumar CSci 5980 Spring 2004 ‹#›


An Illustrative Example

Given: M1: n1 = 30, e1 = 0.15


M2: n2 = 5000, e2 = 0.25
d = |e2 – e1| = 0.1 (2-sided test)

0.15(1 − 0.15) 0.25(1 − 0.25)


ˆ =
d
+ = 0.0043
30 5000
At 95% confidence level, Z/2=1.96

d = 0.100  1.96  0.0043 = 0.100  0.128


t

=> Interval contains 0 => difference may not be


statistically significant
© Vipin Kumar CSci 5980 Spring 2004 ‹#›

You might also like